Information Technology
Guidelines for disk storage on RTC

13-Mar-2008


Introduction

There are several disk storage options on RTC for storing job output.  The correct location for job output depends on the size of the output and the job performance characteristics.   The temporary storage that is available for job output is /scratch which is local storage on each node, and /shared.scratch which is a high performance filesystem shared by all nodes.  The remainder of this document describes the appropriate ways to use the temporary storage areas, where to store job output permanently, and how to perform I/O redirection when running a job with mpiexec.


PBS job output (.OU) and error (.ER) files

It is important to note that PBS writes the standard output (stdout) and standard error (stderr) of your jobs to files with .OU and .ER extensions, respectively.  If you have your job writing any or all of its output to stdout, it will automatically be written to a .OU file in your working directory.   However, while the job is running these files are stored locally on each node at /var/spool/PBS.  The .OU and .ER files are not moved to your home directory until the job exits.  Excessively large .OU files can fill up /var/spool/PBS and cause your job to crash.  Furthermore, this directory location is shared among all processors on each node.  If this directory becomes full, all of the jobs on that node will crash.  Therefore, it is important that you redirect your stdout to an output file somewhere on /shared.scratch or /scratch if your stdout file is going to be larger than approximately 50MB.

NOTE: It is not enough to simply specify the -o argument in your PBS batch script. This argument will simply specify the final location for your .OU file after your job exists. It will still be created in /var/spool/PBS while the job is running. Instead, you should use Linux I/O redirection to avoid having the .OU file created in /var/spool/PBS, such as:


myprogram.exe > /shared.scratch/username/output.file

I/O Redirection with mpiexec

If you are running your job with mpiexec and you need to perform I/O redirection, the proper format of the command will look like this:


mpiexec -comm mpich.gm myprogram.exe > /shared.scratch/username/output.file

Use the man mpiexec command to view an online manual page for more information on mpiexec I/O redirection options.

See our tutorials for more information on Linux I/O redirection operators.




Using /shared.scratch

All nodes on RTC have access to a 5 TB /shared.scratch storage space.   This storage is available to all users and is tuned for high throughput I/O.  This storage area is visible on all compute nodes and the login nodes. To use /shared.scratch, simply create a directory under /shared.scratch, copy your input files to this location, and redirect your output files here as well.  When your job is finished, copy the final results to your permanent storage space on /users or /projects.  /shared.scratch is not for permanent storage.  Data files not accessed in two weeks will be deleted automatically by the system.  Recovery of these files is not possible.   

 


Using /scratch

Each node on RTC has between 25 GB and 65 GB storage space available to all users on /scratch.  It is most appropriate to use this storage space for the output of your running job when your output will not exceed 10GB per job, and the I/O rate of your job is infrequent and the block size per read/write request is small (less than 1MB block size).  Jobs of this nature will perform better writing to /scratch than to /shared.scratch since /shared.scratch is tuned for large data sets with frequenet read/write requests of large amounts of data.  If your output exceeds 10GB, then using /shared.scratch is the only remaining option.   Notice to Gaussian users: please use /shared.scratch for Gaussian scratch files.  Do not use /scratch.

NOTE: Each node has its own /scratch storage area and is not shared among all nodes. If you are running a parallel job, for example, and each node in the job needs to read the same data file(s), then you should use /shared.scratch, not /scratch. All nodes have access to the same /shared.scratch area but each node has access to its own unique /scratch area. Data copied to /scratch on one node will not be visible to any other node.

NOTE: /scratch is shared among all processors on each node.  Therefore, it might be used by 2-4 different jobs at once.  If this partition becomes full, it is likely that all of the jobs on that node will crash.  It is important not to exceed the 10GB recommendation to prevent jobs from crashing.  Any data left on /scratch when a job exits will be deleted without notice and can not be recovered.

To use /scratch, copy your datasets from your source directory to the desired scratch directory at the start of your job, and again, in the reverse direction, at the end of your job. Here are examples of how to copy your data at the beginning and the end of each job:

# copy the data out to the nodes:
mpiexec -allstdin -comm=none -pernode cp /path/to/dataset /scratch
# run your MPI program:
mpiexec ./my-mpi-executable
# copy the data back in to the original location:
mpiexec -allstdin -comm=none -pernode cp /scratch/files /path/to/dataset

NOTE: It is necessary to call mpiexec three times in this example.  The first call is to copy your data from your dataset to /scratch. The second call will run your program.  When your program terminates, it is necessary to go back to the nodes to retrieve your data.  This is accomplished by the third call of mpiexec.  The data will be deleted automatically when the PBS job terminates.  It is very important to note that in order for the third call to mpiexec to work, the second call must be completely finished.  If your job runs out of walltime before the third mpiexec call is finished, your PBS job will exit and the data stored on /scratch on all nodes will be deleted automatically.


Using /users and /projects

The /users and /projects directories are intended for permanent data storage, not job I/O.  These filesystems are NFS filesystems and are not designed to handle high performance applications.  Using them for job I/O might result in severely degraded performance across the entire cluster, especially if the I/O of your job is heavy.  Use of these filesystems for I/O should only be done at the direction of the system administrators.


Questions

If you have questions or need assistance, please contact the Help Desk at 713-348-4357.

IT
Division of Information Technology
MS-119, P.O. Box 1892, Rice University, Houston, Texas 77251-1892
713-348-HELP(4357)