[torqueusers] Strange problem in Torque 2.5.7+

Charles Johnson charles.johnson at accre.vanderbilt.edu
Mon Sep 12 06:39:01 MDT 2011


On Sep 9, 2011, at 11:42 PM, Wickliffe, Blake W wrote:

> We’ve been bitten by a strange problem twice now in Torque, so I thought I’d check to see if anyone else has run into it.  We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs.  All qstat or pbsnodes commands fail.  The process is still in memory but it drops to 0% CPU utilization.
>  
> Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again.  If I clear out all the jobs in the “jobs” directory and restart the server it comes back up fine.  The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang.  Checking the files, all of these jobs were running on two nodes that had crashed.

+1

Yes, we experience the same problem with the same symptoms. It takes me a while to track down the offending nodes. We Have upgraded to 2.5.8 without relief. We use moab 6.0.1 as well.

~Charles~
--
Charles Johnson, Vanderbilt University
Advanced Computing Center for Research & Education
Mailing Address:  Peabody #34, 230 Appleton Place, Nashville, TN 37203
Shipping Address: 1231 18th Avenue South, Hill Center, Suite 143, Nashville, TN 37212
Office: 615-343-4134
Cell: 615-478-7788
Fax: 615-343-7216
charles.johnson at accre.vanderbilt.edu



More information about the torqueusers mailing list