Information Technology

Introduction to SUG@R -
Shared University Grid at Rice

25-Aug-2009


Table of Contents

  •  


    Introduction

    SUG@R is Rice's Intel Xeon compute cluster.   SUG@R contains 134 SunFire x4150 nodes from Sun Microsystems. Each node has two quad-core Intel Xeon processors running at 2.83GHz, yielding a system wide total of 1072 processor cores. There is a maximum of 102 nodes (816 processors) available to all users and is subject to change due to special projects, maintenance tasks, and so on. Each processor can access up to 16GB of RAM.  All nodes use a Gigabit ethernet interconnect.   The system also has three filesystems.  A 9 TB Panasas filesystem provides fast I/O to run user applications, a 1 TB filesystem for user home directories, another 2 TB for group-based allocation and 250 GB for software (/opt/apps).  A complete system overview is online.

    SUG@R is running Red Hat Enterprise 5 Linux and the 2.6.18 kernel.

    Most installed software is in /opt/apps.  See the module command for information on how to use these applications.  If you need any software that is not present, please let us know. All jobs requiring fast network interconnect (MPI jobs) must be run on Ada or STIC. The SUG@R system is designed to support jobs that do not need a fast network interconnect. Therefore, only jobs of 8 processors (one node) or less should be submitted to this system. Recommended parallel job types within a node are MPI (OpenMPI), SMP (OpenMP and compiler-assisted autoparallelization) and threading (pthreads, Java threads). Exceeding one node per job will result in degraded performance for everyone and such jobs are subject to termination without notice at the discretion of the systems administrators.

    For information on the unix shell configuration program called module, PBS, compilers, OpenMPI, and contact information, see the remainder of this document.

    A final note:  Be careful about changing your unix shell's configuration (.profile, .cshrc, .bash, etc) until you get things working.  The system and the necessary shell environment is a little different from Ada and RTC so caution should be used when trying to duplicate your environment from one of these clusters.


    Logging in to SUG@R

    SUG@R can be accessed from any machine on the Rice campus with SSH. If you need off-campus access, you will have to install VPN on your computer and then login to SUG@R via SSH. For more information regarding off-campus access, please visit our Off-Campus Access FAQ

    To login to SUG@R from a Linux or Unix machine, type:


    ssh -Y (your_login_name)@sugar.rice.edu

    To transfer files into SUG@R from a Linux or Unix machine, use scp:

     
    scp some_file.dat *.incl *.txt (your_login_name)@sugar.rice.edu:

    For more information about using SSH, please see our SSH FAQ.

    Login Nodes

    Once you are logged in to SUG@R, you are logged into one of two login nodes as shown in the diagram below. These nodes are intended for users to compile software, prepare data files, and submit jobs to the job queue. They are not intended for running compute jobs. Please run all compute jobs in one of the job queues described later in this document.


    Diagram courtesy of Chris Hunter, Rice University.


    Filesystems, Quotas, and Job Output

    SUG@R currently enforces disk quotas for all users.  There is a quota for home directories (accessed via $HOME) and for the projects (accessed via $PROJECTS). There are no quotas on $SHARED_SCRATCH.  However, this filesystem is for applications that need fast I/O and is not for permanent storage.  Any files on $SHARED_SCRATCH that are not modified for more than one month will be deleted automatically!  Permanent storage is in $HOME and $PROJECTS only. A summary of all filesystems available to all users is presented in the following table:

    Filesystem
    Accessed via environment variable
    Physical Path
    Size
    Quota
    Home directories
    $HOME
    /users
    1 TB
    4 GB
    Group/Project directories
    $PROJECTS
    /projects
    2 TB
    50 GB
    Shared Scratch high performance I/O
    $SHARED_SCRATCH
    /shared.scratch
    9 TB
    None
    Local Scratch on each node
    $LOCAL_SCRATCH
    /scratch
    100 GB
    None

    NOTE: The $HOME filesystem is scheduled for an upgrade later this year which will result in more disk space and larger quotas.

    NOTE: $HOME and $PROJECTS cannot be used for job I/O. Jobs found to be using $HOME and $PROJECTS for job I/O are subject to termination without notice. Please see our FAQ for more details on job I/O.

    NOTE: The physical paths listed in the chart above are subject to change. You should always access the filesystems using environment variables. For example, to access /shared.scratch/dirname, use this command:

     
    cd $SHARED_SCRATCH/dirname

    To see your current quota and your disk usage for your home directory, run this command:

     
    quota -s


    To see the quota and usage for the projects directories for all groups that you belong to, run this command:

     
    quota -s -g

    For information on how to use $PROJECTS, please see our FAQ.



    Customizing Your Environment with the module Command

    Each user can customize their enviroment using the module command.  This command lets you select software and will source the appropriate paths and libraries. All the requested user applications are located under the /opt/apps directory.

    To list what applications are available, type:

     
    sugaruser@sugarhost:~> module avail
    ----------------------------------- /opt/apps/modules/Modules/versions -------------------------------------

    ----------------------------------- /opt/apps/modules/modulefiles ------------------------------------------
    R/2.3.1 fftw/3.1.2 intel/9.0 mpich/1.2.7-gcc3 mpich-gm/1.2.7-intel9
    afni/2006_06_30_1332 gaussian/g03-d1 jdk/1.4.2_12 mpich/1.2.7-intel9 namd/2.6b2
    amber/9.0 gromacs/3.3.1 mkl/8.0.1 mpich-gm/1.2.7-gcc3 octave/2.1.73

    sugaruser@sugarhost:~>


    To load the module for the Intel compiler, use:
     
    module load intel


    For assistance with module, type man module

    For more information on using the module command in PBS batch scripts, please see our FAQ.


    Job Scheduling

    The batch job scheduling system implemented on SUG@R uses the Torque package and the Maui package.  Torque is responsible for resource management, while Maui is responsible for job scheduling and monitoring.

    Fairshare Scheduling Policy

    We implement the Maui fairshare feature to provide a fair utilization of the available resources.  This is accomplished by allowing historical resource utilization information to be incorporated into job feasibility and priority decisions. This is normally the most significant component of a job's priority, which ultimately defines the position of the job on a queue. We do not use a FIFO (First-In-First-Out) scheduler on SUG@R.

    Backfill Scheduling Policy

    This is a scheduling optimization which allows Maui to make better use of available resources by running jobs out of order. Using job data such as walltime and resources requested, the scheduler can start other, lower-priority jobs so long as they do not delay the highest priority jobs.  Because of the way it works, essentially filling in holes in node space, backfill tends to favor smaller and shorter running jobs more than larger and longer running ones.

    NOTE:  It is important to specify an accurate walltime for your job in your PBS submission script.  Selecting the default walltime for jobs that are known to run for less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

    Automatic Queue Routing

    The SUG@R is configured for automatic queue routing.  You do not need to specify which queue you want to use.  The system will assign your job to the commons queue by default.  The exceptions to this rule are the interactive and debugging queues.  You must request these queues with the -q argument to qsub.

    Available Queues and System Load

    We currently provide one queue for general accessibility:


    Queue Name
    Maximum number of nodes per job Maximum number of CPUs per job
    Maximum number of CPUs in use by a single user at any given time Maximum number of CPUs in this queue Minimum Walltime
    Maximum Walltime
    commons
    1 8 32 (normal load)
    128 (light load)
    768 00:00:00
    24:00:00
    interactive 1 8 8 per interactive session 32 00:00:00 00:30:00

    Commons is a standard priority queue that can allocate the maximum number of CPUs per job and currently has a maximum job walltime of 24 hours. The total number of CPUs in this queue is subject to change at any time due to special projects and system maintenance tasks. This system is designed for small, single node jobs. Therefore, jobs requiring more than 8 CPU cores or more than one node are discouraged. These jobs should be run on Ada or STIC.

    Interactive is a higher priority queue with the purpose of serving interactive jobs.  The maximum number of CPUs that can be accessed through this queue is 32 with a maximum job walltime of 30 minutes.  This queue is available 8AM to 10PM each day. See our FAQ for more details.

    NOTE: The maximum number of cores (processors) allowed to be running at one time for any user is 32 under normal load regardless of how many jobs are in the queue or how many cores per job requested. This number will be increased to 128 automatically under light system load. The maximum number of cores (processors) that may be requested in any one job is 8 and they must be within the same node (no MPI traffic between nodes).

    NOTE: Do not run CPU intensive processes on SUG@R's login nodes. Use one of the queues listed above. Any CPU intensive process running on the login nodes is subject to termination without notice.

    There may be other queues present on the system.  These are normally dedicated to special projects/allocations.

    A good way to obtain the status of all queues and their current usage is to run the following PBS command:


     
    sugaruser@sugarhost:~> qstat -q

    server: sugarhost

    Queue Memory CPU Time Walltime Node Run Que Lm State
    ---------------- ------ -------- -------- ---- ----- ----- ---- -----
    commons -- -- 24:00:00 -- 0 0 -- E S
    ----- -----
    0 0


    Here is a brief description of the relevant fields:

    Walltime:  Maximum walltime a job can request
    Run:  Number of jobs in running state
    Que:  Number of jobs in queued state
    State:  The queue is enabled “E” and running (started) "R"

    Determining Why a Job is not Running

    There may be several reasons why a job is not running and appears to be stuck in the queue.  Please see our PBS Job Scheduling FAQ for more information.


    Batch Processing with PBS (Submitting Jobs)

    Once you have an executable program and are ready to run it on the compute nodes, you must create a job script containing the following PBS options:

    • Request the resources that will be needed (i.e. number of processors, wall-clock time, etc.), and
    • Use commands to prepare for execution of the executable (i.e. cd to working directory, source shell environment files, etc).

    After the job script has been constructed you must submit it to the job scheduler for execution. The remainder of this section will describe the anatomy of a PBS script and how to submit and monitor jobs.

    PBS Batch Script Options

    All jobs must be submitted via a PBS batch script or invoking qsub at the command line . See the table below for PBS submission options.

    PBS Submission Options

    Option

    Description

    #PBS -N jobname

    Assigns a job name. The default is the name of PBS job script.

    #PBS -l nodes=1:ppn=2

    The number of nodes and processors per node.

    #PBS -l nodes=1:ppn=1
    #PBS -W x=NACCESSPOLICY:SINGLEJOB

    Using both of these options will give your job exclusive access to a node such that no other jobs can share the node.  This combination of arguments will assign one processor to your job and will give it exclusive access to all of the resources (i.e. memory) of the entire node without interference from other jobs.

    Please see our FAQ for more details on exclusive access.

    #PBS -l walltime=01:00:00

    The maximum wall-clock time needed for this job to run.

    #PBS -l pmem=2000m The maximum amount of physical memory used by any single process of the job (in megabytes). See our FAQ for more details.
    #PBS -q queuename
    Specify the name of the queue to use. 

    #PBS -o mypath

    The full path for the standard output (stdout) .OU files.

    #PBS -e mypath

    The full path for the standard error (stderr) .ER files.

    #PBS -j oe

    Join option that merges the standard error stream with the standard output stream of the job.

    #PBS -V

    Exports all environment variables to the job.

    #PBS -M username@rice.edu Email address for job status messages.
    #PBS -m bae PBS will notify the user via email when the job begins, aborts or terminates.

    #PBS -m n

    Turn off all email from the job.

    Job Launchers (mpiexec)

    The job launcher's purpose is to spawn copies of your executable across the resources allocated to your job. We currently recommend and support mpiexec for this task. It is a cleaner, safer and faster alternative to mpirun. By default mpiexec only needs your executable, the rest of the information will be extracted from PBS.

    The following is an example of how to use mpiexec inside your PBS batch script. This example will run myprogram.exe as a parallel OpenMPI code on all of the processors requested by this example and allocated by PBS:


    
    #PBS -l nodes=1:ppn=4
    
    mpiexec /path/to/myprogram.exe

    NOTE: The above example assumes that myprogram.exe is a program designed to be parallel (using MPI). If your program has not been parallelized, then running on more than one processor will not improve performance and will result in wasted processor time.

    Job Scripts

    A job script may consist of PBS directives, comments and executable statements. A PBS directive provides a way of specifying job attributes in addition to the command line options. For example, we could create a myjob.pbs script this way:


     
    #PBS -N JOBNAME
    #PBS -l nodes=1:ppn=2,pmem=4000m,walltime=00:30:00 #PBS -M username@rice.edu
    #PBS -m abe #PBS -V

    echo "My job ran on: "
    cat $PBS_NODEFILE
    cd $PBS_O_WORKDIR
    mpiexec /path/to/myprogram.exe


    NOTE:  It is important to specify an accurate walltime for your job in your PBS submission script.  Selecting the default walltime for jobs that are known to run for less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

    If you need to debug your program and want to run in interactive mode, the same request could be constructed like this (via the qsub command):


     
    qsub -I -N JOBNAME -q interactive -V -l nodes=1:ppn=2,pmem=4000m,walltime=00:30:00

    For more details on interactive jobs, please see our FAQ on this topic.

    PBS Environment Variables in Job Scripts

    When you submit a job, it will inherit several environment variables that are automatically set by PBS. These environment variables can be useful in your job submission scripts as seen in the examples above. A summary of the most important variables are presented in the table below.

    Variable Name
    Description
    $TMPDIR
    Location of scratch space on each node. See our FAQ for more details.
    $PBS_NODEFILE
    Location of a file that contains a list of all nodes assigned to the job.
    $PBS_O_WORKDIR
    Path from where the job was submitted.

    Submitting and Monitoring Jobs

    Once your job script is ready, use qsub to submit it:

     
    qsub /path/to/myjobs.pbs

    This will return a jobid while the output and error stream of the job will be saved to two files inside the directory where the job was submitted. 

    The status of the job can be obtained using Maui commands.  See Table 2 for a list of Maui commands.


    Table 2. Maui commands

    Command

    Description

    showq

    Show a detailed list of all submitted jobs.

    checkjob job.ID

    Show a detailed description of the job given by job.ID.

    showstart job.ID

    Gives an estimate of the expected start time of the job given by job.ID

    There are four different states that a job can be after submission: active, idle, blocked or deferred. The showq command with no arguments will list all jobs in their current state.

    Active (Running): These are jobs that have been started.

    Idle: These jobs are eligible to run but there's simply not enough resources to allocate to them at this time.

    Blocked: These jobs aren't being considered for running, probably due to a policy violation. Jobs will eventually get out of this state and go into the idle queue.  For instance, a queue has reached the maximum number of active processors assigned to it and it's blocking all jobs until resources are released by active jobs.

    Deferred: Jobs in this state normally have a batch hold which means that they requested resources of a type or amount that do not exist on the system. (walltime, number of nodes, etc). If your job is deferred, please review the resource requirements on your submission script and make sure that the destination queue can satisfy them.

    Modifying and Deleting Jobs

    It is possible to modify job attributes after the job was submitted and is not in the running state. The pbs command qalter supports all of the parameters available on qsub.  This example reduces the walltime originally requested for the job:


     
    qalter -l walltime=00:03:00 <jobid>

    A job can also be relocated to a different queue using the qmove command :


     
    qmove <queuename> <jobid>

    A job can be deleted by using the qdel command:


     
    qdel <jobid>


    Compilers and Programming

    Several programming models are supported on SUG@R.   Programs that are of sequential and parallel (within a node) can be submitted. Sequential programs require one processor to run. Parallel programs utilize multiple processors concurrently. The maximum size of a parallel job on SUG@R is 8 processors. Message passing and threaded applications generally fit under the scope of parallel computing. Recommended parallel job types within a node are MPI (OpenMPI), SMP (OpenMP and compiler-assisted autoparallelization) and threading (pthreads, Java threads).

    The supported compilers on SUG@R are Intel and GCC with Intel being the preferred compiler. OpenMPI implementations of Intel and GCC are available and can be loaded upon demand using the module command.

    Compiling Serial Code

    First of all you will have to load the appropriate compiler environment. To do so you will have to type:

     
    module load intel


    Once the environment is set, you can compile your program with one of the following (using Intel compiler as an example):

     
    icc -o executablename sourcecode.c

    icc -o executablename sourcecode.cc

    ifort -o executablename sourcecode.f77

    ifort -o executablename sourcecode.f90

    When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man ifort ).

    Compiling Parallel Code

    To compile a parallel version of your code that has OpenMPI calls, use the appropriate OpemMPI library. Again, use module to load the appropriate compiler environment as follows (Intel versions highly recommended):

    module command
    Description
    module load openmpi/1.2.6-gcc For gcc compiled version
    module load openmpi/1.2.6-intel.10.1.015 For Intel compiled version

    NOTE:  All OpenMPI versions compiled with Intel automatically load the intel 10 module package.

    To compile your code you will have use the OpenMPI scripts that are currently in your default path. The OpenMPI scripts are responsible for invoking the compiler, linking your program with the OpenMPI library and setting the OpenMPI include files.

    Once the environment is set, you can compile your program with one of the following (assuming the Intel compiler as above):

    
    mpicc -o executablename mpi_sourcecode.c
    
    mpicxx -o executablename mpi_sourcecode.cc

    mpif77 -o executablename mpi_sourcecode.f77

    mpif90 -o executablename mpi_sourcecode.f90

    When invoked as described above, the compiler will perform the preprocessing, compilation, assembly and linking stages in a single step. The output file (or executable) is specified by executablename and the source code file is specificed by mpi_sourcecode.f77, for example. Omitting the -o executablename option will result in the executable being named a.out by default. For additional instructions and advanced options please view the online manual pages for each compiler (i.e. execute the command man mpif77 ).

    GNU Compiler

    The GNU compiler is installed as part of the Red Hat Enterprise Linux distribution. Use man gcc to view the online manual for the C and C++ compiler, and man gfortran to view the online manual for the Fortran compiler.


    Getting Help

    If you have any further questions please see our FAQ.  If you still have questions, please let us know:

        http://helpdesk.rice.edu
        helpdesk@rice.edu
        713-348-4357

    Please follow our guidelines when contacting the Help Desk for faster problem resolution.

    IT
    Division of Information Technology
    MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
    713-348-HELP(4357)