[torqueusers] Strange problem in Torque 2.5.7+

Charles Johnson charles.johnson at accre.vanderbilt.edu
Mon Sep 12 06:39:01 MDT 2011

On Sep 9, 2011, at 11:42 PM, Wickliffe, Blake W wrote:

> We’ve been bitten by a strange problem twice now in Torque, so I thought I’d check to see if anyone else has run into it.  We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs.  All qstat or pbsnodes commands fail.  The process is still in memory but it drops to 0% CPU utilization.
> Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again.  If I clear out all the jobs in the “jobs” directory and restart the server it comes back up fine.  The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang.  Checking the files, all of these jobs were running on two nodes that had crashed.


Yes, we experience the same problem with the same symptoms. It takes me a while to track down the offending nodes. We Have upgraded to 2.5.8 without relief. We use moab 6.0.1 as well.

Charles Johnson, Vanderbilt University
Advanced Computing Center for Research & Education
Mailing Address:  Peabody #34, 230 Appleton Place, Nashville, TN 37203
Shipping Address: 1231 18th Avenue South, Hill Center, Suite 143, Nashville, TN 37212
Office: 615-343-4134
Cell: 615-478-7788
Fax: 615-343-7216
charles.johnson at accre.vanderbilt.edu

More information about the torqueusers mailing list