[torqueusers] Strange problem in Torque 2.5.7+
Charles Johnson
charles.johnson at accre.vanderbilt.edu
Mon Sep 12 06:39:01 MDT 2011
On Sep 9, 2011, at 11:42 PM, Wickliffe, Blake W wrote:
> We’ve been bitten by a strange problem twice now in Torque, so I thought I’d check to see if anyone else has run into it. We are running Torque 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon hangs. All qstat or pbsnodes commands fail. The process is still in memory but it drops to 0% CPU utilization.
>
> Restarting the pbs_server allows it to come back up for a few seconds but then it hangs again. If I clear out all the jobs in the “jobs” directory and restart the server it comes back up fine. The last time this happened, I was able to move jobs back into the directory a few at a time and keep restarting the pbs_server until I isolated the few jobs that were causing the server to hang. Checking the files, all of these jobs were running on two nodes that had crashed.
+1
Yes, we experience the same problem with the same symptoms. It takes me a while to track down the offending nodes. We Have upgraded to 2.5.8 without relief. We use moab 6.0.1 as well.
~Charles~
--
Charles Johnson, Vanderbilt University
Advanced Computing Center for Research & Education
Mailing Address: Peabody #34, 230 Appleton Place, Nashville, TN 37203
Shipping Address: 1231 18th Avenue South, Hill Center, Suite 143, Nashville, TN 37212
Office: 615-343-4134
Cell: 615-478-7788
Fax: 615-343-7216
charles.johnson at accre.vanderbilt.edu
More information about the torqueusers
mailing list