Frequently Asked Questions (FAQ)

Old MPI binary from HLRN-III phase 1 fails to run (HLRN-III phase 2)

Problem description:
Assuming you have built an MPI binary (statically linked) during the time of the HLRN-III until September 2014. The binary fails fo work. In the job output file you receive the following message.

MPICH2 ERROR [Rank 0] [job id 2245138] [Fri Oct 17 09:01:34 2014] [c2-0c0s6n2] [nid00410] - 
  MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_SMSG had error (SOURCE_SSID:AT_MDD_INV:CPLTN_DREQ)
  Rank 0 [Fri Oct 17 09:02:14 2014] [c2-0c0s6n2] Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1534)......: MPI_Bcast(buf=0x7fffffff8848, count=3, MPI_FLOAT, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1372).:
MPIR_Bcast_intra(1163):
MPIR_SMP_Bcast(1080)..: Failure during collective
_pmiu_daemon(SIGCHLD): [NID 00410] [c2-0c0s6n2] [Fri Oct 17 09:02:14 2014] PE RANK 0 exit signal Aborted
[NID 00410] 2014-10-17 09:02:15 Apid 2245138: initiated application termination

Answer:
Your binary does not run on the new HLRN system, phase 2. Please rebuild your program.


Do binaries for mpp1 run on mpp2, too? (HLRN-III phase 2)

Answer:
Yes, all binaries built for mpp1 (Intel Ivybridge) run on mpp2 (Intel Haswell).


Do binaries for mpp2 run on mpp1, too? (HLRN-III phase 2)

Answer:
No, these are binaries built for mpp2 (Intel Haswell) which can not run on mpp1 (Intel Ivybridge).


A binary runs on mpp2 and fails on mpp1 (HLRN-III phase 2)

Problem description:
Binaries using features of the mpp2 CPUs (Intel Haswell), e.g. AVX2 functions, can not run on mpp1. In the job output file you receive the following message.

[NID 00141] 2014-10-17 14:45:01 Apid 2245455: initiated application termination
Application 2245455 exit signals: Illegal instruction

Answer:
Rebuild the binary without AVX2 features.


Batch system: Questions and Errors

Batch jobs are in Hold or Deferred state

Question: My batch jobs are in state Deferred, BatchHold or SystemHold, and I did not hold them myself. What is the reason for that, and what can I do?

Answer: The detailed state of a batch job shows up in the output of showq or checkjob.

The reason for a hold or deferred state can be seen in the output of checkjob -v at the very bottom. The reasons can be very diverse. The most common reasons and the actions that follow from it are:

  • Deferred and BatchHold: When a job fails to start, the job is temporarily deferred by the batch system, but set to idle automatically after some minutes. After about 24 trials, the batch system gives up and puts the job into the BatchHold state. This state can be released by a system admin, only. Usually the state of the queue is watched by system admins, which take care about these jobs and try to get them running. If you feel that nobody cares about it, contact support@hlrn.de.
  • BatchHold: The resource requests of the job are so strange, that they cannot be fulfilled by the batch system. Remove the job from the queue by executing canceljob, and resubmit them with correct setup.
  • SystemHold: The job was most likely put into this state by the accounting system because of insufficient funds (please also see this question) or manually by a system admin. It can only be freed by a system admin. Feel free to contact support@hlrn.de to ask what happened.

Note that jobs that are held for more than 14 days are removed from the queue automatically.


Job does not start, checkjob shows "Holds: System:BankFailure"

Question: My jobs are waiting in the queue without being executed and my jobs have been put on SystemHold, and checkjob -v <jobid> shows a line

Holds:          System:BankFailure

and/or messages like

Message[0] Insufficient funds (-788.322) to run job under account <account_name>

or

Message[0] Account not recognized

What does this mean?

Answer: This means that the project against which the job is accounted, does either not contain positive NPL funds anymore or is not valid anymore ("Expired"); in other words: you ran out of NPLs. You may check the account via the HLRN Service Portal. Submit your job with a different account (#PBS -A ...), if available. See also this FAQ.


msub Error: "ERROR: cannot locate valid destination resource manager"

Question: When I try to submit a batch job with msub, the command fails with:

ERROR:    cannot locate valid destination resource manager -
cannot use partition green in RM green - class pipeq has no configured resources
What does this mean, and what do I have to do to submit a job successfully?

Answer: This happens when you try to request resources for your batch job which cannot be fulfilled by the currently configured batch system classes. This error occurs e.g. when you request too many nodes, the walltime is too large or you did not request the minimum number of nodes. Please have a look on the currently configured resource limits and resubmit your job accordingly.


Batch job is terminated immediately with the error "Job cannot be executed"

Question: I successfully submitted a job, but the job is terminated immediately, the job script was not run at all. I even did not get a job output file with the STDOUT/STDERR of my batch job. Instead, I got a mail from the batch system with the error message:

Aborted by PBS Server
Job cannot be executed

My job scripts looks like this:

#!/bin/bash
#PBS -l feature=mpp2:test
#PBS -l nodes=1
#PBS -l walltime=00:10:00
#PBS -j oe
#PBS -o /gfs2/work/<userid>/<subdirectory>

aprun <app.exe> > app.out

Answer and solution: Please add a slash at the end of the line specifying the directory for your job output file, so that it looks like this:

#PBS -o /gfs2/work/<userid>/<subdirectory>/


Copy/move fails with the message "Protocol driver not attached" or "Das Protokoll ist nicht verfügbar"

Question: I try to copy/move a file from my $WORK/$HOME to the data archive $PERM. The command fails with the error message:
cp: writing `/qfs1/perm/user/dir/file': Protocol driver not attached
or with German locale environment
cp: Schreiben von `/qfs1/perm/user/dir/file': Das Protokoll ist nicht verfügbar
What is wrong?

Answer: You have most probably hit your file or inode quota on $PERM. Ask your consultant to increase your quota on $PERM. You may also contact support@hlrn.de .


Warning: feupdateenv is not implemented and will always fail

Question: When I link my code with MVAPICH2, the link step ends with the message
.../libimf.so: warning: feupdateenv is not implemented and will always fail
I'm not sure what to do now. I'm using the Intel compilers.

Answer: This is a warning, which is generally harmless. The executable is linked correctly, and you can use it.

If this warning frightens you, add the -shared-intel option to the link flags. For older Intel compilers, use -i-dynamic, instead. However, doing so may have much more impact than seeing this warning.


Assembler Error: no such instruction: ...

Question: Compiling fails with strange assembler error messages like
Error: no such instruction: 'pminsd %xmm0,%xmm4'
What should I do?

Answer: You have the following choices:

  1. Compile with a less aggressive optimization option (e.g. -O2 instead of -O3).
  2. Request less processor-specific instruction sets (e.g. for gcc: -msse3 instead of -msse4.1).
  3. Use a more recent assembler version. Execute as --version to see what you are currently using. This assembler is probably the one that is provided in /usr/bin as part of the current SuSE Linux distribution used at HLRN. Then see module avail binutils about what is installed in addition to the system-default. If there is a version number that is higher than the system-default, load the appropriate binutils modulefile, and try compiling again. See module help binutils for further information.

"X11 connection rejected because of wrong authentication."

Question: When I login with either ssh -X myuserid@bicegate.hlrn.de or ssh -Y myuserid@hicegate.hlrn.de I get the message
X11 connection rejected because of wrong authentication.
Connection lost to X server `localhost:NN.0'
and I can use no graphical tools or GUIs. What is wrong?

Answer: Probably your diskspace for your home directory is full, i.e. you may have hit your file system quota on $HOME. Check the output of the quota command for numbers appended with a '*' character, e.g.

$ quota -s
Disk quotas for user myuserid (uid 98765):
     Filesystem  blocks   quota   limit   grace   files   quota limit   grace
bhome-iio:/mnt/home_b
                23280M*  20971M 209715M  21days   11335       0     0
If this is the case try cleaning up your home directory.

I cannot contact external servers/services

Question: I have to contact an external server from within the HLRN, but it does not work, what can I do?

Answer: Connections from HLRN are permitted only to registered target hosts ("Zielrechner") from the HLRN login and data nodes. Users manage the list of their target hosts via the corresponding web form at the HLRN Service Portal. Changes to the list of target hosts become effective only after a short delay.

Note: XC30 nodes (MOM, compute) cannot establish outgoing connections.


Last modification: WolfgangBaumann - 13 Apr 2017 13:46 (Version: 51)
Printable | Top of Page | Sitemap                   
 
2001-2017 ©   HLRN Logo   Norddeutscher Verbund für Hoch- und Höchstleistungsrechnen (HLRN).     This site is powered by the TWiki collaboration platform