[torqueusers] Strange problem in Torque 2.5.7+

"Mgr. Šimon Tóth" toth at fi.muni.cz
Tue Sep 13 03:05:33 MDT 2011


> We’ve been bitten by a strange problem twice now in Torque, so I thought
> I’d check to see if anyone else has run into it.  We are running Torque
> 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server daemon
> hangs.  All qstat or pbsnodes commands fail.  The process is still in
> memory but it drops to 0% CPU utilization.
> 
>  
> 
> Restarting the pbs_server allows it to come back up for a few seconds
> but then it hangs again.  If I clear out all the jobs in the “jobs”
> directory and restart the server it comes back up fine.  The last time
> this happened, I was able to move jobs back into the directory a few at
> a time and keep restarting the pbs_server until I isolated the few jobs
> that were causing the server to hang.  Checking the files, all of these
> jobs were running on two nodes that had crashed.
> 
>  
> 
> So, in essence, a pbs_mom node crashed and took down the entire cluster
> with it.  As I said, we’ve seen this happen twice now.  Has anyone else
> seen this?

The issue is that the dis_tcp_wflush() function can hang for a loooong
time. The server will wait until all data are sent, which can be hours,
if the other side is slow enough. Also this will hang until timeouts
occur when the other side is dead.

The patch I included is what we have done with the tcp_dis.c file.
Sorry, that the patch isn't clean, but unfortunatelly, we have a lot of
fixes and I don't really have the time to dig one specific out.

Specificially ignore the GSSAPI (kerberos) stuff and concentrate on the
dis_tcp_wflush() function. The alarms are the important stuff.

We are using 60 seconds timeouts, you should tailor that towards your
cluster, the most stuff that will get send here are qstat replies, which
for 8000 jobs should be somewhere around 100MB of data.

-- 
Mgr. Simon Toth
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tcp.diff
Type: text/x-patch
Size: 5271 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20110913/dfc1ae0f/attachment-0001.bin 


More information about the torqueusers mailing list