|
Shared Tightly Integrated Cluster (STIC)
stic.rice.edu
System News
STIC system stability enhancements
posted on 2009-11-12 10:34:41
We have resolved two major outstanding system issues with STIC in the past two weeks, one concerning the scheduling stack (queuing system) and the remaining affecting local storage in the nodes. If you experienced queue delays or mysterious job terminations these were very likely to be the cause.
You should expect the stability of the system to gradually increase in the next few months and we hope to be in full production state early in 2010.
Thank you for your understanding and keep the feedback coming, it is the most effective way we have to enhance the productivity of Rice's computing resources.
STIC online 10/13, 5PM
posted on 2009-10-14 22:01:48
The electrical upgrade has been completed and the system has been released back to the users.
Even though the system is still in limited production, the only job crashes that have been reported were known symptoms of this disk anomaly, so *please* let us know if your job exits abnormally, hangs or dies, or exhibits any strange behavior that cannot clearly be attributed to the application.
Maintenance extended until 6PM
posted on 2009-10-13 12:28:16
We have completed the maintenance on the drives, but we have decided to upgrade the power distribution and redundancy now while the system is offline. The electrical equipment arrived earlier than we anticipated so we will take advantage of this opportunity. This will save having to take another maintenance period in a week. The system will remain offline no later than 6 pm.
System reboot at 10AM on Oct. 13
posted on 2009-10-13 10:04:01
We have been monitoring an intermittent SATA drive problem that has caused sporadic local disk crashes. We have linked the problem to a very peculiar uptime-triggered hard drive issue that will require a hard drive firmware bug fix to correct. We shall roll this required downtime into the upcoming full maintenance period and instead perform a simple recycling of the power. This will allow us to avoid the disk problem for 49 more days.
Following the vendor's recommendation we will be shutting down (a complete power off) of all nodes on, Tuesday October 13th at 10am in order to remove the threat of local storage crashes on the compute blades. The Compute queue will be safely drained, but jobs in the Exception queue may suffer premature termination.
This maintenance should take no more than 2-3 hours to complete and cofirm.
|