![]() |
|||
|
Why is my program slow?
Job Efficiency on RTC 15-May-2008 Introduction You might experience program execution slowdowns on RTC for a variety of reasons. Among the reasons might be "floating point assist errors", NFS disk performance issues, or resource bottlenecks on shared nodes. This document will describe all of these issues. Floating Point Assist Errors If you experience a program slowdown consistently every time you run your program, you may be suffering from the a common problem on Itanium CPUs called "floating point assist errors". To determine if this is the case, just recompile, relink, and rerun using the -ftz flag to see if your code speeds up. See the remainder of this FAQ for more details. What is this 'floating point assist' error? When running at the shell command prompt or looking through the kernel's log file in /var/log/messages, you may encounter messages of the form: test-fpsr(1416): floating-point assist fault at ip 40000000000005d2 This message means that program "test-fpsr" with process id 1416 performed a floating-point operation that required software assistance. On Itanium, this usually happens when operating on IEEE denormals (floating point numbers that cannot be represented in IEEE normalized form). Again, just like for unaligned accesses, these operations are emulated in the kernel, so there is nothing to worry about in terms of correctness. It's the logging of the events that's slowing down your code. To turn off the exception logging, Intel's Itanium library provides a function to turn on "flush-to-zero" mode, which avoids the spurious preserving and logging IEEE denormals. This mode can be turned on two ways: 1. Compile and link using the -ftz or -ffast-math flag 2. Using the following code fragment in each affected function: #include ? If 'floating point assist' is not your problem, then you may need to profile your program to see where the time is being spent. See our profiling FAQ for more details. Filesystems for Job I/O If your program's slowdown is intermittent, appearing especially when you read/write more data than usual or when the cluster is very busy, your program may be running slowly due to slow NFS disk performance. RTC has several filesystems available for storing data. The two most commonly used are /users and /projects which are designed for permanent storage of data and are not intended for job I/O. Using these filesystems for job I/O might result in degraded performance of your jobs. Alternatives for better performance are to use the local disk on each node or the /shared.scratch filesystem. To determine which filesystem to use, please see our FAQ regarding disk storage guidelines. Exclusive Access to Nodes If you have a job that is resource intensive (i.e. disk or memory) then it might be desirable to request exclusive access to nodes for your jobs. By default, RTC will pack up to two jobs per node (2 processors, or 4 in the case of the quad-processor nodes). These jobs might all belong to you or some might belong to someone else. If your jobs are CPU intensive, this might not be a problem. However, if your jobs are memory or disk intensive, then your jobs might be competiting for resources with other jobs on the same node. This will degrade the performance of all of the jobs on the node. Instructions for requesting exclusive access can be found in our node access FAQ. Physical Memory Limitations Each node on RTC has 2GB of physical memory shared among all of the processors on each node (except for the largemem nodes which have more). With all nodes on the cluster being shared amonsg all users, it is possible that a single process owned by a single user might be consuming all of the memory on the node. If this happens, new jobs assigned to that node will likely not have enough memory to run. If your job is memory intensive, it would be best to request a node with enough available memory to run your job. This can be done by requesting access to a node that has enough free physical memory to run the job as described in our FAQ.
|
|||
|