Information Technology
Guidelines for disk storage on Ada

10-Jun-2008


Introduction

There are several disk storage options on Ada for storing job output.  The correct location for job output depends on the size of the output and the job performance characteristics.   The temporary storage that is available for job output is /scratch which is local storage on each node, and /lustre which is a high performance filesystem shared by all nodes.  The remainder of this document describes the appropriate ways to use the temporary storage areas, where to store job output permanently, and how to perform I/O redirection when running a job with mpiexec.


PBS job output (.OU) and error (.ER) files and I/O redirection

It is important to note that PBS writes the standard output (stdout) and standard error (stderr) of your jobs to files with .OU and .ER extensions, respectively.  If you have your job writing any or all of its output to stdout, it will automatically be written to a .OU file in your working directory.   However, while the job is running these files are stored locally on each node at /var/spool/PBS.  The .OU and .ER files are not moved to your home directory until the job exits.  Excessively large .OU files can fill up /var/spool/PBS and cause your job to crash.  Furthermore, this directory location is shared among all four processors on each node.  If this directory becomes full, the jobs on all four processors will crash.  Therefore, it is important that you redirect your stdout to an output file somewhere on /lustre or /scratch if your stdout file is going to be larger than approximately 50MB.

NOTE: It is not enough to simply specify the -o argument in your PBS batch script. This argument will simply specify the final location for your .OU file after your job exists. It will still be created in /var/spool/PBS while the job is running. Instead, you should use Linux I/O redirection to avoid having the .OU file created in /var/spool/PBS, such as:


myprogram.exe > /lustre/username/output.file

I/O Redirection with mpiexec

If you are running your job with mpiexec and you need to perform I/O redirection, the proper format of the command will look like this:


mpiexec $XD1LAUNCHER myprogram.exe > /lustre/username/output.file

Use the man mpiexec command to view an online manual page for more information on mpiexec I/O redirection options.

See our tutorials for more information on Linux I/O redirection operators.



Using /lustre

All nodes on Ada have access to a 5TB /lustre storage space.   This storage is available to all users and is tuned for large block (1MB block size or larger), high throughput I/O.  This storage area is visible on all compute nodes and the login nodes. To use /lustre, simply create a directory under /lustre, copy your input files to this location, and redirect your output files here as well.  When your job is finished, copy the final results to your permanent storage space on /home (also called /users) or /projects.  /lustre is not for permanent storage.  Data files not accessed in two weeks will be deleted automatically by the system.  Recovery of these files is not possible.    Writing data to /lustre using small block sizes will result in degraded performance of the /lustre filesystem since it is not tuned for small data sets or I/O with small block sizes.

Data Organization within /lustre

Your data organization on /lustre can impact your job I/O performance. Many jobs performing I/O to the same directory concurrently will degrade I/O performance on that directory due to the way the Lustre filesystem keeps track of your files. It is best to distribute your data across multiple directories if you observe poor I/O performance. More importantly, storing thousands of files in a single directory will seriously degrade I/O performance in this directory, which can result in jobs hanging and being unable to exit. This in turn will negatively impact the job scheduler causing a cascade effect across the cluster. It is important to note that the negative impact on the cluster grows exponentially with the number of active/running jobs that are using this overloaded directory. If your workload must access thousands of files, you must distribute the files proportionally across multiple subdirectories. A good way to determine if a directory is overloaded is to go to that directory and get a directory listing with the ls command. The output of the command should only take a few seconds. If it takes longer than this then you should consider the directory to be overloaded. As an example, a directory with 50,000 files was observed to take almost 3 minutes to get a directory listing. This is unacceptable performance for most job scheduling components which will then fail and never recover. This in turn degrades the entire cluster.


Using /scratch

Each node on Ada has a 60GB /scratch storage space available to all users.  It is most appropriate to use this storage space for the output of your running job when your output will not exceed 10GB per job, and the I/O rate of your job is infrequent and the block size per read/write request is small (less than 1MB block size).  Jobs of this nature will perform better writing to /scratch than to /lustre since /lustre is tuned for large data sets with frequenet read/write requests of large amounts of data.  If your output exceeds 10GB, then using /lustre is the only remaining option.   Notice to Gaussian users: please use /lustre for Gaussian scratch files.  Do not use /scratch.

NOTE: Each node has its own /scratch storage area and is not shared among all nodes. If you are running a parallel job, for example, and each node in the job needs to read the same data file(s), then you should use /lustre, not /scratch. All nodes have access to the same /lustre area but each node has access to its own unique /scratch area. Data copied to /scratch on one node will not be visible to any other node.

NOTE: Note that /scratch is shared among all four processors on each node.  Therefore, it might be used by four different jobs at once.  If this partition becomes full, it is likely that all of the jobs on that node will crash.  It is important not to exceed the 10GB recommendation to prevent jobs from crashing.  Any data left on /scratch when a job exits will be deleted without notice and can not be recovered.

Best Practices for Using /scratch (Using $TMPDIR)

The best way to use /scratch is to make use of the PBS $TMPDIR variable.  This is an environment variable that PBS sets for each running job.  It points to a temporary directory that PBS creates for you on /scratch.  Writing your output into the $TMPDIR directory will ensure that you will have your own directory, separate from any other job, for writing your data without concern for overwriting files for other jobs.  When your job is complete, copy your data from $TMPDIR to its permanent storage location in /home (also called /users) or /projects.  When your job exits, PBS will automatically delete the temporary directory and all of its contents.  Recovery of these files is not possible.  Here are two example PBS batch scripts that use $TMPDIR (assumes output goes to stdout):

Multiprocessor Job

#PBS -N JOBNAME
#PBS -q compute
#PBS -l nodes=2:ppn=2,walltime=00:30:00

       
echo "My job ran on: "
cat $PBS_NODEFILE
#The next line will run your job with stdout redirected 
mpiexec $XD1LAUNCHER ./myprogram > $TMPDIR/outputfile
#The next line will go back to each node and copy the data from $TMPDIR
#to your permanent storage location.
#This assumes that the jobs on each node need to write
#data to $TMPDIR, which might not be the case.         
mpiexec $XD1LAUNCHER -comm none cp $TMPDIR/outputfile /yourworkingdirectory

NOTE: It is necessary to call mpiexec twice in this example.  The first call is to run your program which will generate output at $TMPDIR.  When your program terminates, it is necessary to go back to the nodes to retrieve your data.  This is accomplished by the second call of mpiexec.  If the data you have stored on $TMPDIR is scratch data only and does not need to be saved, then the second call to mpiexec is not necessary.  The data will be deleted automatically when the PBS job terminates.  It is very important to note that in order for the second call to mpiexec to work, the first call must be completely finished.  If your job runs out of walltime before the second mpiexec call is finished, your PBS job will exit and the data stored at $TMPDIR on all nodes will be deleted automatically.

Single Processor Job


#PBS -N JOBNAME
#PBS -q compute
#PBS -l nodes=1:ppn=1,walltime=00:30:00

       
echo "My job ran on: "
cat $PBS_NODEFILE
#The next line will run your job with stdout redirected 
./myprogram > $TMPDIR/outputfile
#The next line will copy the data from $TMPDIR to your permanent storage location

cp -r $TMPDIR/outputfile /yourworkingdirectory

Using /home, /users and /projects

The /home (also called /users) and /projects directories are intended for permanent data storage, not job I/O.  These filesystems are NFS filesystems and are not designed to handle high performance applications.  Using them for job I/O might result in severely degraded performance across the entire cluster, especially if the I/O of your job is heavy.  Use of these filesystems for I/O should only be done at the direction of the system administrators.


Questions

If you have questions or need assistance, please contact the Help Desk at 713-348-4357.

IT
Division of Information Technology
MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
713-348-HELP(4357)