Information Technology
Why Are My Jobs Not Running?
PBS Scheduling on Ada and STIC

06-Oct-2009


Introduction

Ada and STIC use Torque (PBS) to launch jobs and Moab to schedule jobs.  This scheduler is configured with a "fairshare" policy rather than a FIFO (First-In First-Out) policy.  A FIFO policy could result in one or a few users dominating the job queue indefinitely.  A fairshare policy means that the job priority of an individual user will drop as their job count increases.  This could result in old jobs remaining in the queue waiting to run while newer jobs run immediately.  The remainder of this document describes this affect and how you can determine if this is happening to your jobs.


Factors that Influence Job Priority

There are three primary factors that determine a job's priority in the queue:
  • Resource reservations - queue resources have been reserved in advance.  This is usually for time-specific projects or demonstrations.
  • Resources Requested - Moab assigns higher priorities to jobs that request large numbers of processors as compared to small numbers of processors.  The more processors requested, the higher the priority will be.
  • Fairshare - the primary factor for most jobs.  The more jobs a user runs (CPU time used), the lower their priority will be on future jobs.  This policy is based on up to 7 days of historical usage.  This will allow new jobs to run ahead of jobs that are already in the queue waiting to run.


Why is My Job Stuck in the Queue?

There are situations that may occur that can cause your job to appear stuck in the queue unable to run:

Incorrect Resource Request

  • One reason a job might be stuck in the queue is a request for an invalid set of resources.  The most common error is to request too many processors per node (ppn).  Requesting more than four (Ada) or eight (STIC) processors per node represents a request that can not be satisfied so PBS will not be able to schedule the job even though it will be accepted into the queue. 

Fairshare Policy

  • You submit a job, but jobs submitted after yours run first.  The most likely reason for this to happen is that your priority is lower than other users based on the 7 day historical usage of our fairshare policy.  To find out when a job is predicted to run, use the showstart <jobID> command where <jobID> is the job ID number for your job. To see how your Fairshare score is calculated, please see our FAQ.

Backfill Policy

  • This is a scheduling optimization which allows Moab to make better use of available resources by running jobs out of order. Using job data such as walltime and resources requested, the scheduler can start other, lower-priority jobs so long as they do not delay the highest priority jobs.  Because of the way it works, essentially filling in holes in node space, backfill tends to favor smaller and shorter running jobs more than larger and longer running ones.
NOTE:  It is important to specify an accurate walltime for your job in your PBS submission script.  Selecting the default walltime for jobs that are known to run for less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

Backfill Chunking

  • You submit a job requiring a small number of processors and there are enough idle processors available, but your job remains queued.  This likely means that the system is currently waiting for enough idle processors to become available so that higher priority jobs can run.  In this case, the system will allow running jobs to finish but will not start any new jobs until the higher priority jobs have enough resources to run.  In effect, the system is reserving node space in advance for a job that is waiting in the queue.

Node Fragmentation/Resources Requested

  • You submit a multiprocessor parallel job and should have a high priority, but the job remains queued.  Using an example of an 8-node, 4 processor per node (32 processor job), the scheduler will not schedule your job until all processors on all eight nodes are idle at the same time.  On a busy system with hundreds of jobs submitted at random times, it is unlikely that randomly scheduled jobs are going to finish at the same time such that 8 nodes with 32 processors are going to be available at any given time.  Essentially the node space has become fragmented.  So your job will remain queued and other jobs in the queue will backfill and be allowed to utilize individual idle processors as they become available before your job is "projected" to start.  However, this situation is unlikely to occur due to the fact that the Moab scheduler assigns higher priority to jobs requesting large numbers of CPUs.  You may only have to wait until existing running jobs exit before your muliprocessor job runs because it will be submitted with a high priority automatically. The system will delay lower priority jobs so that high priority jobs will be able to run. However, if you do not require all of the processors on each node, you might want to request 32 nodes, 1 processor per node. This will require only one idle processor per node on any 32 nodes for your job to run. This type of job might get scheduled sooner than an 8 node, 4 processor per node job. Using only a single processor per node means that your job will be sharing the node with other users however. Keep this in mind as it might have a performance impact on your job if your job is memory intensive.

When Will My Job Run?
Using the showstart command

In order to determine when your job is projected to start, use the showstart <jobID> command where <jobid> is the ID number of your job. This command will report the projected start time of your job based on your priority and position in the queue relative to all other jobs. The projected start time is just a point-in-time estimate and will change as jobs enter and leave the queue.

In some cases showstart will indicate that your job should start in 00:00:00, which means it should start immediately, but the job remains waiting in the queue anyway. In this case, your job is most likely being blocked by Backfill Chunking as described above. Unfortunately the current version of our scheduler will not report this data to you. This will be addressed in a future release.


How Do I See My Job Priority?
Using the showq and diagnose commands

There are two ways to determine your job priority. One way is to use the showq -i command. This command will list all jobs waiting to run and will display them in order of priority with the highest priority first.

showq -i 
eligible jobs----------------------
JOBID                 PRIORITY  XFACTOR  Q  USERNAME    GROUP  NODES     WCLIMIT     CLASS      SYSTEMQUEUETIME
50451*                   47612     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:36:44
50452*                   47592     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:36:47
50456                    47592     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:37:45
50453                    47572     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:36:50
50458                    47572     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:37:51
50454                    47552     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:36:53
50459                    47552     10.9  -       user  group     48     4:00:00   compute  Mon Aug 20 19:37:54
        

In the sample output above, jobs 50451 and 50452 have asterisks by their jobID numbers. This indicates that the scheduler has guaranteed that these jobs are the next jobs to run unless higher priority jobs are submitted in the interim. The asterisks indicate that the scheduler is currently reserving node space for these jobs and will continue to do so until the jobs run. The only exception to this rule is if a higher priority job is submitted before these jobs run. In this case, the higher priority jobs will run first. If the jobs with asterisks by them are not the top priority jobs, then the scheduler is trying to backfill lower priority jobs.

The second way to determine job priority is to use the diagnose -p command. This command will list all jobs on the system in order of priority with the highest priority first. It will also show how the priority is calculated. Here is an example of the output:

diagnose -p | more
       diagnosing job priority information (partition: ALL)
Job                PRIORITY*   Cred( User:Group:Accnt:  QOS:Class)    FS( User)  Serv(QTime:Bypas) Res( Proc)
         Weights   --------       1(    1:    1:    1:    1:    1)   100(   12)     2(    1:   10)  100(    5)
507922                67684     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  44.0( 24.8)  11.7(2989.: 96.0)  44.3(200.0)
497961                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497962                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497963                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497964                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497965                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497966                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497967                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
497968                61536     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  35.0( 17.9)  16.3(4039.: 96.0)  48.8( 60.0)
515236                61272     0.0(  0.0:  0.0:  0.0:  0.0:  0.0)  40.2( 20.5)  10.8(2358.: 96.0)  49.0(200.0)
      
        

Each column in the output for each job represents a component of the priority value. The total of each line will be 100%. The most important columns above are FS (Fairshare), Serv (Service), and Res (Resources Requested), although values might appear in all columns.

For example, job 507922 above has 44% of its priority determined by the Fairshare policy as described earlier in this document. The lower the historical usage of the owner of the job, the higher the Fairshare value will be. It has 11.7% determined by its Service time (wait time in queue which is 2989 minutes and the number of times it has been bypassed to run, 96 times). The longer this job waits for execution, the higher this percentage will become. This job also has 44.3% of its priority determined by the Resources (number of CPUs) that it has requested (200 as listed in parenthesis).

The scheduler does not treat each component of the priority equally however. The current queue policy will favor larger jobs so the Resource component is weighted more heavily than the other components. The system will then calculate the job priority by taking the values of all of the priority components into consideration. These priority values are dynamic and will change over time and will change as jobs enter and exit the queue.

To see this, note that job 515236 above has a 40.2% FS component and a 49% Res component but is behind several jobs that have lower FS and Res components (jobs 497961 through 497968). This is because those jobs have a higher Serv component (16.3%) because they have been waiting longer. So the system is favoring those jobs slightly due to their wait time in the queue even though they have only requested 60 CPUs while job 515236 has requested 200. In this case the size of the job was not enough to grant it a higher priority because it had not been waiting in the queue very long relative to the other jobs.

Also note that a job with a very low, perhaps even zero, FS value is an indication that this user has had a very high historical usage over the last 7 days. This user's priority will be lower based on this fact. The higher the utilization for a user, the lower the FS value will be in this output. The reverse is also true. A high FS value means that this user has had low usage over the last 7 days.


How is my Fairshare Score Determined?

The Fairshare score (FS) is determined by the historical utilization of a user over a 7 day window. To see the Fairshare score, run the diagnose -f command:

diagnose -f | more
       
Depth: 7 intervals   Interval Length: 1:00:00:00   Decay Rate: 1.00
FS Policy: DEDICATEDPS 
         System FS Settings:  Target Usage: 0.00    Flags: 0
FSInterval        %     Target       0       1       2       3       4       5       6
FSWeight       ------- -------  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000
TotalUsage      100.00 -------  9344.2 12212.2 12949.5 12733.6 12645.3 12186.2 12100.5
USER
-------------
user1*            3.14  25.00+    1.11    4.17    2.96    2.73    2.18    5.34    3.11
user2* 14.52 25.00+ 12.18 16.83 56.42 13.55 ------- ------- -------

The above sample output (truncated) shows the Fairshare score for two users. The first user, user1, has a score of 3.14. In this context this means that this user's utilization has been low (about 3%) over the last 7 days. The utilization for the last 7 days (columns 0 through 6) reflect low utilization on each day. In comparison, user2 has a score of 14.52 which reflects high utilization (about 14%) as is shown on days 0 through 3. The user had no utilization on days 4 through 6.

NOTE: The Fairshare score displayed here has an inverse relationship to the number shown with diagnose -p. The score shown here is a percentage while diagnose -p shows a score. The lower the utilization percentage, the higher the score.

NOTE: It is possible for a Fairshare value to continue to rise even as your utilization drops. In the case of user2 above, the utilization value for days 4 through 6 is zero. Over time these days will disappear from the calculation. Replacing no utilization with low utilization in the average calculation will make your fairshare score go up while your utilization is dropping.


Why Are My Jobs Scheduled Out of Order?

It is possible for you to submit many jobs and have the newer jobs scheduled to run ahead of your older jobs that are already in the queue. This is based on the starting priority assigned to your job when it is submitted. Your starting priority is based on all of the scheduling factors listed above in this document. These scheduling factors can change minute by minute depending on cluster activity, especially due to the Fairshare policy. Your Fairshare priority changes over time as your historical usage increases or decreases. If your Fairshare priority is increasing while you are submitting jobs, then jobs submitted last are likely to have higher priority than jobs submitted first. To see your starting priority, run the checkjob <jobID> command where <jobID> is the job you are interested in. If the order of execution for your jobs is important, please see our FAQ.


Getting Help

If you need help in understanding why your job has a particular position in the queue, please submit a request to the Help Desk. It is very important that you include the output of showq -i and diagnose -p. Since the job priorities change dynamically, it is likely that the priorities will have changed significantly by the time we see your request for help. So the output of the above command will give us a snapshot of the job priorities at the time you requested help.

 

IT
Division of Information Technology
MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
713-348-HELP(4357)