[torqueusers] Strange problem in Torque 2.5.7+

Wickliffe, Blake W blake.wickliffe at aramco.com
Fri Sep 9 22:42:20 MDT 2011


Hello,

We've been bitten by a strange problem twice now in Torque, so I thought I'd check to see if anyone else has run into it.  We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs.  All qstat or pbsnodes commands fail.  The process is still in memory but it drops to 0% CPU utilization.

Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again.  If I clear out all the jobs in the "jobs" directory and restart the server it comes back up fine.  The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang.  Checking the files, all of these jobs were running on two nodes that had crashed.

So, in essence, a pbs_mom node crashed and took down the entire cluster with it.  As I said, we've seen this happen twice now.  Has anyone else seen this?

Regards,

Blake Wickliffe
Saudi Aramco
ENOD/CSYS/USG HPC Team
(873-4417)


________________________________

The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as "this Email"), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110910/e09784d1/attachment.html 


More information about the torqueusers mailing list