Information Technology

How to use the scavenge queue

Instructions for running single core and multi core jobs up to 168 hours on STIC

23-Aug-2010


Introduction

STIC is designed to run multicore jobs (9 cores or more) for 8 hours per job. The system prohibits jobs that are 8 cores or less. These job restrictions apply to the compute queue which is a common, shared pool of nodes available to all users. STIC also has several dedicated nodes called condos. A STIC condo is a dedicated part of STIC that is not accessible via the default compute queue. These condos are restricted to departments/research groups which purchased dedicated hardware for their personal use. Access to these condos are available via queues restricted by access control lists (ACLs). At times, these condos are not 100% utilized, thus creating an opportunty for users to utilize additional compute resources outside of the compute queue.

The scavenge queue was designed to use up available cycles on STIC condos. This is called "compute cycle scavenging", hence the queue name scavenge. We have implemented a preemption based policy to address this situation. In this system, condo users have the highest priority and can preempt scavenge jobs when they are ready to run. Therefore, it is important that jobs running in the scavenge queue be checkpointable or be tolerant of abrupt termination. The scavenge jobs will run on the condo nodes only when the condo users are not using them and will be terminated when condo users submit jobs. The scavenge queue has a walltime limit of 168 hours thus allowing a maximum, non-guaranteed runtime of a up to a week for a typical scavenge job. The scavenge queue also allows single core jobs, unlike the compute queue where users are required to request at least 2 nodes.


Accessing the scavenge queue

To access the scavenge queue, please specify the scavenge queue in your PBS batch script as follows:


#PBS -q scavenge
              

If you plan on using the scavenge queue interactively, use the following options in your qsub command line:


qsub -I -q scavenge

You will want to add other options in your PBS batch script or qsub command line such as ppn, walltime, and so on.

Some nodes within the scavenge queue have Infiniband interconnect and some only have Gigabit Ethernet. If you need to run an MPI job with Infiniband interconnect, you must request the Infiniband feature on your resource list similar to the following:


qsub -l nodes=2:ppn=8:ib
        

When will my jobs get preempted?

Jobs currently running in the scavenge queue will be preempted (terminated) when a condo user submits jobs to the condo nodes. This means that the scavenge queue jobs will be terminated and will not be restarted. Therefore it is important that jobs running in the scavenge queue be checkpointable or be tolerant of abrupt termination.


How will I know if my job has been preempted?

You will receive an email when the job finishes or is preempted (terminated) if you have email notification enabled in your PBS job script


How many cores are available in the scavenge queue?

There is a total of 1168 cores available for scavenging, their availability will be subject to the current usage of the condos.


What is the maximum job size allowed in the scavenge queue?

Each user can consume up to 384 cores when the system is idle and 128 cores when the system is busy.

IT
Division of Information Technology
MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
713-348-HELP(4357)