Reasons Why Jobs Fail to Execute

Blocked Jobs:

Even if the queue has adequate headroom, a job may still fail to run. The primary reasons are as follows:

  • intermittent issues - the cluster may experience temporary glitches due to any number of factors ranging from power fluctuations to software incompatibility issues. During these intermittent breakages, a job may be suspended until stability is reestablished.  
  • bad resource requests - Oftentimes, users make requests that exceed available resources, such as requesting walltimes beyond the maximum limit or failing to balance resources to the point that the job cannot adequately handle the data download.  

Furthermore, when a job fails, the scheduler attempts to reload it every 10 minutes over a 48-hour period. This creates still more drag on the cluster, heightening the potential for additional issues.  

How to Check a Blocked Job:

$ showq this command shows the priority output order of jobs in the queue.
$ showstart this command estimates when a job will start.
$ checkjob ######   this command diagnoses the reason why the job failed to start

 

Dependencies:

Occasionally, circumstances may arise where the output of one job influences the input of another. If this situation occurs, users have the option of postponing the second job until the first one completes. To learn more about this process:

$ man qsub

Note: Once you are in the qsub manual, either scroll to the topic or type /dependencies for more information on how they function

Additionally, the following commands can also be used to define job hold specifications:

$ sethold
$ showhold
$ releasehold
$ jobshold