Resource Limits and Scheduling Policies

Why Job Resources Must Be Considered:

As the Halcyon cluster is a shared system, jobs must be designed to occupy as few resources as possible (e.g., walltime, processor speed, and RAM). This practice ensures the unnecessary delay of other jobs in the queue. Additionally, specific queues can be chosen to best meet a job's resource requirements. 

Programs typically provide documentation detailing the number of processors required under a range of performance variables. Once the appropriate number of processors has been selected, memory and walltime configurations can then be aligned with the processor requests. There is no exact formula for configuring memory and walltime with processor requirements; however, experience will serve to improve accuracy. Ideally, all resources should balance one another to obtain optimum job performance. Even if processors and RAM are evenly matched, too much walltime can slow a job's throughput (consequently affecting all jobs in the queue), and not enough walltime can kill the job entirely.  

Each square in this diagram represents one of the cluster's 40 nodes:

 

4 Different Queues:

The Halcyon cluster contains four queues: submit, interactive, compute, and bigiron. As the submit queue splits into the interactive and compute queues, there are three ways to submit jobs. 

1. normal jobs (compute) - this queue is for standard jobs, and it is the cluster's default queue. To submit a job, enter the following:

$ qsub script

2. interactive jobs - this queue is automatically selected when submitting an interactive job:

$ interactive

3. bigiron queue – the bigiron queue is used primarily for memory-intensive jobs. It is restricted, and a request must be made to bioservices for access. This queue is selected when you submit your job as follows:

$ qsub -q bigiron script

 

The interactive-compute queue and the bigiron queue run on discrete partitions within the overall cluster: the resource utilization of one has no bearing on the other. However, usage limits may or may not be governed by both queues as follows:

  • Utilizing the bigiron queue affects usage limits on the interactive-compute queue.
  • Utilizing the compute-interactive queue does not affect usage limits on the bigiron queue. 

​Scheduling Policies:

There are a number of scheduling policies designed to ensure overall fair use of the cluster. These policies are based on a hierarchy that determines job priority in relation to available resources, and most of them are automatically built into the queue scheduler. There are a host of factors that influence the order of the queue, and several of these will be covered. 

How Priority is Determined:

There is no exact formula for determining job priority: it may be influenced by a variety of factors ranging from rush jobs per client request to diagnostic maintenance by the systems administrator. In general, priority is determined by two factors:

  • The order in which jobs are submitted to the queue.
  • Fairness factors based on an individual or group's history of usage over a 24-hour period (which is divided into three 6-hour increments with an 80% rate of decay per increment). 

Preemption:

With preemption, high priority jobs can suspend low priority ones. Preemption pertains strictly to the bigiron queue, and it is rarely implemented. 

 

Backfilling:

Backfilling is a means of maximizing the full potential of the cluster by fitting lower priority jobs into the available gaps between higher priority jobs. As the scheduler sets the queue on the basis of priority, this process invariably generates unused resources between jobs. If a lower priority job's resource demands enable it to run in the leftover space between higher priority jobs without affecting the queue schedule, it will implement.