![]() |
||||
|
Guidelines for disk storage on Ada
10-Jun-2008 Introduction There are several disk storage options on Ada for storing job
output. The correct location for job output depends on the size
of the output and the job performance characteristics. The
temporary storage that is available for job output is /scratch which is
local storage on each node, and /lustre which is a high performance
filesystem shared by all nodes. The remainder of this document
describes the appropriate ways to use the temporary storage areas, where to store job output permanently, and how to perform I/O redirection when running a job with mpiexec. PBS job output (.OU) and error (.ER) files and I/O redirectionIt is important to note that PBS writes the standard output (stdout) and standard error (stderr) of your jobs to files with .OU and .ER extensions, respectively. If you have your job writing any or all of its output to stdout, it will automatically be written to a .OU file in your working directory. However, while the job is running these files are stored locally on each node at /var/spool/PBS. The .OU and .ER files are not moved to your home directory until the job exits. Excessively large .OU files can fill up /var/spool/PBS and cause your job to crash. Furthermore, this directory location is shared among all four processors on each node. If this directory becomes full, the jobs on all four processors will crash. Therefore, it is important that you redirect your stdout to an output file somewhere on /lustre or /scratch if your stdout file is going to be larger than approximately 50MB. NOTE: It is not enough to simply specify the -o argument in your PBS batch script. This argument will simply specify the final location for your .OU file after your job exists. It will still be created in /var/spool/PBS while the job is running. Instead, you should use Linux I/O redirection to avoid having the .OU file created in /var/spool/PBS, such as:
If you are running your job with mpiexec and you need to perform I/O redirection, the proper format of the command will look like this:
Use the man mpiexec command to view an online manual page for more information on mpiexec I/O redirection options. See our tutorials for more information on Linux I/O redirection operators.
Using /lustreAll nodes on Ada have access to a 5TB /lustre storage space. This storage is available to all users and is tuned for large block (1MB block size or larger), high throughput I/O. This storage area is visible on all compute nodes and the login nodes. To use /lustre, simply create a directory under /lustre, copy your input files to this location, and redirect your output files here as well. When your job is finished, copy the final results to your permanent storage space on /home (also called /users) or /projects. /lustre is not for permanent storage. Data files not accessed in two weeks will be deleted automatically by the system. Recovery of these files is not possible. Writing data to /lustre using small block sizes will result in degraded performance of the /lustre filesystem since it is not tuned for small data sets or I/O with small block sizes. Data Organization within /lustre Your data organization on /lustre can impact your job I/O performance. Many jobs performing I/O to the same directory concurrently will degrade I/O performance on that directory due to the way the Lustre filesystem keeps track of your files. It is best to distribute your data across multiple directories if you observe poor I/O performance. More importantly, storing thousands of files in a single directory will seriously degrade I/O performance in this directory, which can result in jobs hanging and being unable to exit. This in turn will negatively impact the job scheduler causing a cascade effect across the cluster. It is important to note that the negative impact on the cluster grows exponentially with the number of active/running jobs that are using this overloaded directory. If your workload must access thousands of files, you must distribute the files proportionally across multiple subdirectories. A good way to determine if a directory is overloaded is to go to that directory and get a directory listing with the ls command. The output of the command should only take a few seconds. If it takes longer than this then you should consider the directory to be overloaded. As an example, a directory with 50,000 files was observed to take almost 3 minutes to get a directory listing. This is unacceptable performance for most job scheduling components which will then fail and never recover. This in turn degrades the entire cluster. Using /scratchEach node on Ada has a 60GB /scratch storage space available to all
users. It is most appropriate to use this storage space for the
output
of your running job when your output will
not exceed 10GB per job, and
the I/O rate of your job is infrequent and the block size per
read/write request is small (less than 1MB block size). Jobs of this nature
will
perform better writing to /scratch than to /lustre since /lustre is
tuned for large data sets with frequenet read/write requests of large
amounts of data. If your output exceeds 10GB, then using /lustre
is the only remaining option. Notice to Gaussian users: please use /lustre for Gaussian scratch
files. Do not use /scratch. NOTE: Note that /scratch is shared among all four processors on each node. Therefore, it might be used by four different jobs at once. If this partition becomes full, it is likely that all of the jobs on that node will crash. It is important not to exceed the 10GB recommendation to prevent jobs from crashing. Any data left on /scratch when a job exits will be deleted without notice and can not be recovered. Best Practices for Using /scratch (Using $TMPDIR) The best way to use /scratch is to make use of the PBS $TMPDIR
variable. This is an environment variable that PBS sets for each
running job. It points to a temporary directory that PBS creates
for you on /scratch. Writing your output into the $TMPDIR
directory will ensure that you will have your own directory, separate
from any other job, for writing your data without concern for
overwriting files for other jobs. When your job is complete, copy
your data from $TMPDIR to its permanent storage location in /home (also called /users) or
/projects. When your job exits,
PBS will automatically delete the temporary directory and all of its
contents. Recovery of these files is not possible. Here are two example PBS batch scripts that use $TMPDIR (assumes output
goes to
stdout):
NOTE: It is necessary to call mpiexec twice in this
example. The first call is to run your program which will
generate output at $TMPDIR. When your program terminates, it is
necessary to go back to the nodes to retrieve your data. This is
accomplished by the second call of mpiexec.
If the data you have stored on $TMPDIR is scratch data only and does
not need to be saved, then the second call to mpiexec is not necessary. The
data will be deleted automatically when the PBS job terminates.
It is very important to note that in order for the second call to mpiexec to work, the first call
must be completely finished. If your job runs out of walltime
before the second mpiexec call is finished, your PBS job will exit and the data stored at $TMPDIR on all nodes will be deleted automatically.
Using /home, /users and /projectsThe /home (also called /users) and
/projects directories are intended for permanent data storage, not job
I/O. These filesystems are NFS filesystems and are not designed
to handle high performance applications. Using them for job I/O
might result in severely degraded performance across the entire
cluster, especially if the I/O of your job is heavy. Use of these
filesystems for I/O should only be done at the direction of the system
administrators.
Questions |
||||
|