Information Technology
Introduction to RTC -
Rice's HP Intanium 2 Cluster

07-Aug-2008


Introduction

The RTC is Rice's Itanium 2 cluster.  It has a total of 286 Intel Itanium 2 processors (900 MHz processors on 124 dual processor nodes and 4 quad processor nodes, 1.3MHz processors on 6 nodes).  Each processor can access up to 4GB of RAM (16GB or 32GB on the quad nodes).  The system contains 65 nodes using the Myrinet interconnect while Gigabit is available on all nodes.  The system also has three filesystems.  A 5 TB PVFS (parallel) filesystem (/shared.scratch) provides fast I/O to run user applications; 700 GB for user home directories (/users) and another 2.5 TB for group-based allocation (/projects) and 150 GB for software (/opt/apps).  A complete hardware overview is online.

RTC is running Red Hat Enterprise 4 Linux and the 2.6.9 kernel.

Most installed software is in /opt/apps.  See the module command for information on how to use these applications.  If you need any software that is not present, please let us know.

For information on the unix shell configuration program called module, PBS, compilers, MPI, and contact information, see the remainder of this document.

A final note:  Be careful about changing your unix shell's configuration (.profile, .cshrc, .bash, etc) until you get things working.  The system and the necessary shell environment is a little different from Ada so caution should be used when trying to duplicate your Ada environment on RTC.


Logging in to RTC

RTC can be accessed from any machine on the Rice campus with SSH. If you need off-campus access, you will have to install VPN on your computer and then login to RTC via SSH. For more information regarding off-campus access, please visit our Off-Campus Access FAQ

To login to RTC from a Linux or Unix machine, type:


ssh -Y (your_login_name)@rtc.rice.edu

To transfer files into RTC from a Linux or Unix machine, use scp:

 
scp some_file.dat *.incl *.txt (your_login_name)@rtc.rice.edu:

For more information about using SSH, please see our SSH FAQ.

Login Nodes

Once you are logged in to RTC, you are logged into one of three login nodes. These nodes are intended for users to compile software, prepare data files, and submit jobs to the job queue. They are not intended for running compute jobs. Please run all compute jobs in one of the job queues described later in this document.


Filesystems and Disk Quotas

RTC currently enforces disk quotas for all users.  There is a 4 GB quota for home directories (/users).  There is a 50GB quota for the projects (/projects) allocation.  There are no quotas on /shared.scratch.  However, /shared.scratch is for applications that need fast I/O and is not for permanent storage.  Any files on /shared.scratch that are not modified for more than two weeks will be deleted automatically!  Permanent storage is in /users and /projects only. 

NOTE: Do not use /users and /projects for job I/O. Please see our FAQ for more details on job I/O.

To see your current quota and your disk usage, run this command:

 
quota -s


To see the quota and usage for all groups that you belong to, run this command:

 
quota -s -g

For information on how to use /projects, please see our FAQ.



Customizing Your Environment with the module Command

Each user can customize their enviroment using the module command.  This command lets you select software and will source the appropriate paths and libraries. All the requested user applications are located under the /opt/apps directory.

To list what applications are available, type:

 
rtcuser@rtchost:~> module avail
----------------------------------- /opt/apps/modules/Modules/versions -------------------------------------

----------------------------------- /opt/apps/modules/modulefiles ------------------------------------------
R/2.3.1 fftw/3.1.2 intel/9.0 mpich/1.2.7-gcc3 mpich-gm/1.2.7-intel9
afni/2006_06_30_1332 gaussian/g03-d1 jdk/1.4.2_12 mpich/1.2.7-intel9 namd/2.6b2
amber/9.0 gromacs/3.3.1 mkl/8.0.1 mpich-gm/1.2.7-gcc3 octave/2.1.73

rtcuser@rtchost:~>


To load the module for the Intel compiler, use:
 
module load intel

For assistance with module, type man module

For more information on using module with a PBS batch script, please see our FAQ.

Job Scheduling

The batch job scheduling system implemented on RTC uses the Torque package and the Maui package.  Torque is responsible for resource management, while Maui is responsible for job scheduling and monitoring.

Fairshare Scheduling Policy

We implement the Maui fairshare feature to provide a fair utilization of the available resources.  This is accomplished by allowing historical resource utilization information to be incorporated into job feasibility and priority decisions. This is normally the most significant component of a job's priority, which ultimately defines the position of the job on a queue. We do not use a FIFO (First-In-First-Out) scheduler on RTC.

Backfill Scheduling Policy

This is a scheduling optimization which allows Maui to make better use of available resources by running jobs out of order. Using job data such as walltime and resources requested, the scheduler can start other, lower-priority jobs so long as they do not delay the highest priority jobs.  Because of the way it works, essentially filling in holes in node space, backfill tends to favor smaller and shorter running jobs more than larger and longer running ones.

NOTE:  It is important to specify an accurate walltime for your job in your PBS submission script.  Selecting the default walltime for jobs that are known to run for less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

Automatic Queue Routing

The RTC is configured for automatic queue routing.  You do not need to specify which queue you want to use.  The system will determine which queue your job will run in based on the walltime you specify when you submit your job.  The exceptions to this rule are the interactive and super queues.  You must request these queues with the -q argument to qsub. If you request a specific queue other than the interacitve and super queues, you will receive the error "Access to queue is denied" and your job will not run .

Available Queues and System Load

We currently provide six queues for general accessibility:


Queue Name
# of CPUs
Minimum Walltime
Maximum Walltime
short
<=262
N/A
24:00:00
long
<=196
24:00:01
48:00:00
verylong
<=134
48:00:01
168:00:00
super
<=38
N/A
336:00:00
interactive
<=8
N/A
01:00:00
dedicated
Requires Approval
N/A
Requires Approval

Interactive is a higher priority queue with the purpose of serving debugging sessions and interactive jobs.  The maximum number of CPUs that can be accessed through this queue is 4 with a maximum job walltime of 60 minutes.  This queue is only available from 8:00 a.m. to 8:00 p.m.

Super is a special queue for accessing quad-processor and large memory nodes. There are nine quad-processor nodes and one dual-processor node in this queue. Clock speeds range from 900MHz to 1.3GHz. RAM ranges from 16GB to 32GB. There are a total of 38 processors in this queue. More details about the super queue can be found in our FAQ.

NOTE: Do not run CPU intensive processes on RTC's login nodes. Use one of the queues listed above. Any CPU intensive process running on RTC's login nodes is subject to termination without notice.

There may be other queues present on the system.  These are normally dedicated to special projects/allocations.

A good way to obtain the status of all queues and their current usage is to run the following PBS command:


 
rtcuser@rtchost:~> qstat -q

server: rtchost

Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- ----- ----- ---- -----
interactive -- -- 01:00:00 -- 0 0 -- E S
short -- -- 24:00:00 -- 5 3 -- E S
long -- -- 48:00:00 -- 0 0 -- E S
verylong -- -- 168:00:00 -- 0 0 -- E S
----- -----
5 3


Here is a brief description of the relevant fields:

Walltime:  Maximum walltime a job can request
Run:  Number of jobs in running state
Que:  Number of jobs in queued state
State:  The queue is enabled “E” and running (started) "R"

Determining Why a Job is not Running

There may be several reasons why a job is not running and appears to be stuck in the queue.  Please see our PBS Job Scheduling FAQ for more information.


Batch Processing with PBS

Once you have an executable, you need to create a job script containing the following PBS options:

  • Request the resources that will be needed (i.e. number of processors, wall-clock time, either Myrinet or Ethernet for the MPI network, etc.), and
  • Use commands to prepare for execution of the executable (i.e. cd to working directory, source shell environment files, etc).

 

See Table 1 below for PBS submission options.

Table 1. PBS Submission Options

Option

Description

#PBS -N jobname

Assigns a job name. The default is the name of PBS job script.

#PBS -l nodes=2:ppn=2:myrinet

The number of nodes, processors per node, and MPI Myrinet network (only for parallel jobs).  Do not specify ppn greater than 2 (or 4 for the quad processor nodes) or the job will not run.

#PBS -l nodes=1:ppn=1
#PBS -W x=NACCESSPOLICY:SINGLEJOB

Using both of these options will give your job exclusive access to a node such that no other jobs can share the node.  This combination of arguments will assign one processor to your job and will give it exclusive access to all of the resources (i.e. memory) of the entire node without interference from other jobs.

Please see our FAQ for more details on exclusive access.

#PBS -l walltime=01:00:00

The maximum wall-clock time needed for this job to run.

#PBS -l pmem=1000m The maximum amount of physical memory used by any single process of the job (in megabytes). See our FAQ for more details.
#PBS -q queuename
Specify the name of the queue to use.  Only required for the interactive and super queues.   Specifying a queue name will actually prevent the job from running, except for interactive and super queues.

#PBS -o mypath

The full path for the standard output (stdout) .OU files.

#PBS -e mypath

The full path for the standard error (stderr) .ER files.

#PBS -j oe

Join option that merges the standard error stream with the standard output stream of the job.

#PBS -V

Exports all environment variables to the job.

#PBS -M username@rice.edu Email address for job status messages.
#PBS -m bae PBS will notify the user via email when the job begins, aborts or terminates.

#PBS -m n

Turn off all email from the job.

Job Launchers (mpiexec, mpirun)

The job launcher's purpose is to spawn copies of your executable across the resources allocated to your job. We currently recommend and support mpiexec for this task. It is a cleaner, safer and faster alternative to mpirun. By default mpiexec only needs your executable, the rest of the information will be extracted from PBS.

Examples:

Run “myprogram” as a parallel mpi code on each of the processors allocated by PBS using Myrinet:



           ##include the myrinet option on the -l line in your PBS batch script
#PBS -l nodes=2:ppn=2:myrinet

mpiexec -comm mpich-gm ./myprogram

Run “myprogram” using Ethernet:



mpiexec -comm mpich-p4 ./myprogram

For more information on using mpiexec to launch your job with Ethernet or Myrinet, please see our FAQ.

We still provide mpirun if your application must use it because it doesn't support anything else.  Note that rsh is the default communication protocol for mpirun. However, RTC requires ssh for the communication protocol. The following example is the job presented above launched using mpirun with ssh configured as the default protocol :


 
export RSHCOMMAND=/usr/bin/ssh
mpirun -np 8 -machinefile $PBS_NODEFILE myprogram

Make sure you configured passwordless SSH in your account prior running mpirun or communication between the nodes assigned to your job will fail..

Job Scripts

A job script may consist of PBS directives, comments and executable statements. A PBS directive provides a way of specifying job attributes in addition to the command line options. For example, we could create a myjob.pbs script this way:


 
#PBS -N JOBNAME
#PBS -l nodes=2:ppn=2,pmem=4000m,walltime=00:30:00 #PBS -M username@rice.edu
#PBS -m abe #PBS -V

echo "My job ran on: "
cat $PBS_NODEFILE
cd $PBS_O_WORKDIR
mpiexec ./myprogram


If you need to debug your program and want to run in interactive mode, the same request could be constructed like this:


 
qsub -I -N JOBNAME -q interactive -l nodes=2:ppn=2,pmem=4000m,walltime=00:30:00 ./myprogram


NOTE:  It is important to specify an accurate walltime for your job in your PBS submission script.  Selecting the default walltime for jobs that are known to run for less time may result in the job being delayed by the scheduler due to an overestimation of the time the job needs to run.

Submitting and Monitoring Jobs

Once your job script is ready, use qsub to submit it:

 
qsub ./myjobs.pbs

This will return a jobid while the output and error stream of the job will be saved to two files inside the directory where the job was submitted. 

The status of the job can be obtained using Maui commands.  See Table 2 for a list of Maui commands.


Table 2. Maui commands

Command

Description

showq

Show a detailed list of all submitted jobs.

checkjob job.ID

Show a detailed description of the job given by job.ID.

showstart job.ID

Gives an estimate of the expected start time of the job given by job.ID

There are four different states that a job can be after submission: active, idle, blocked or deferred. The showq command with no arguments will list all jobs in their current state.

Active (Running): These are jobs that have been started.

Idle: These jobs are eligible to run but there's simply not enough resources to allocate to them at this time.

Blocked: These jobs aren't being considered for running, probably due to a policy violation. Jobs will eventually get out of this state and go into the idle queue.  For instance, a queue has reached the maximum number of active processors assigned to it and it's blocking all jobs until resources are released by active jobs.

Deferred: Jobs in this state normally have a batch hold which means that they requested resources of a type or amount that do not exist on the system. (walltime, number of nodes, etc). If your job is deferred, please review the resource requirements on your submission script and make sure that the destination queue can satisfy them.

Modifying and Deleting Jobs

It is possible to modify job attributes after the job was submitted and is not in the running state. The pbs command qalter supports all of the parameters available on qsub.  This example reduces the walltime originally requested for the job:


 
qalter -l walltime=00:03:00 <jobid>

A job can also be relocated to a different queue using the qmove command :


 
qmove <queuename> <jobid>

A job can be deleted by using the qdel command:


 
qdel <jobid>


Compilers and Programming

Several programming models are supported on RTC.   Programs that are of sequential, parallel or distributed can be run. Sequential programs require one processor to run. Parallel and distributed programs utilize multiple processors concurrently. Parallel programs are a subset of distributed programs. Generally speaking, distributed computing involve parametric sweeps, task farming, etc. Message passing, threaded applications generally fit under the scope of parallel computing.

SPMD is one of the most popular method of parallelism, where a single executable works on its own data.

The supported compilers on RTC are Intel, GCC, and J2EE SDK. MPICH implementations of Intel and GCC are available and can be loaded upon demand using the module command.

Compiling Serial Code

First of all you will have to load the appropriate compiler environment. To do so you will have to type:

 
module load intel


Once the environment is set, you can compile your program with one of the following (using Intel compiler as an example):

 
icc foo.c

icc foo.cc

ifort foo.f77

ifort foo.f90


Compiling Parallel Code

To compile a parallel version of your code that has MPI calls, use the appropriate mpich library. Again, use module to load the appropriate compiler environment as follows (Intel versions highly recommended):

module command
Description
for p4 (ethernet) MPI
 
module load mpich/1.2.7-gcc3 For gcc compiled version
module load mpich/1.2.7-intel9 For Intel compiled version
for gm (myrinet) MPI
 
module load mpich-gm/1.2.7-gcc3 For gcc compiled version
module load mpich-gm/1.2.7-intel9 For Intel compiled version

Prior to the RTC upgrade in August, 2006, the use command was required instead of the module command.  Here is how the use command translates to the module command:

use command
module command
use intel80-mpichp4 module load mpich-1.2.7-intel9
use intel80-mpichgm module load mpich-gm/1.2.7-intel9
NOTE:  All mpich versions compiled with Intel automatically load the intel9 module package.

To compile your code you will have use the MPICH scripts that are currently in your default path. The MPICH scripts are responsible for invoking the compiler, linking your program with the MPI library and setting the MPI include files (mpi.h and mpif.h).

Once the environment is set, you can compile your program with one of the following (assuming the Intel compiler as above):


mpicc -o foo mympifoo.c

mpicxx -o foo foo.cc

mpif77 -o foo foo.f77

mpif90 -o foo foo.f90


For more information on compiling your code to support Ethernet or Myrinet, please see our FAQ.

Getting Help

If you have any further questions please see our FAQ.  If you still have questions, please let us know:

    http://helpdesk.rice.edu
    helpdesk@rice.edu
    713-348-4357

Please follow our guidelines when contacting the Help Desk for faster problem resolution.

IT
Division of Information Technology
MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
713-348-HELP(4357)