[torqueusers] Strange problem in Torque 2.5.7+

Ken Nielson knielson at adaptivecomputing.com
Tue Sep 13 09:08:16 MDT 2011


----- Original Message -----
> From: "Mgr. Šimon Tóth" <toth at fi.muni.cz>
> To: "Torque Users Mailing List" <torqueusers at supercluster.org>
> Sent: Tuesday, September 13, 2011 3:05:33 AM
> Subject: Re: [torqueusers] Strange problem in Torque 2.5.7+
> 
> > We’ve been bitten by a strange problem twice now in Torque, so I
> > thought
> > I’d check to see if anyone else has run into it.  We are running
> > Torque
> > 2.5.7 on a large-ish cluster (3000+ nodes) and the pbs_server
> > daemon
> > hangs.  All qstat or pbsnodes commands fail.  The process is still
> > in
> > memory but it drops to 0% CPU utilization.
> > 
> >  
> > 
> > Restarting the pbs_server allows it to come back up for a few
> > seconds
> > but then it hangs again.  If I clear out all the jobs in the “jobs”
> > directory and restart the server it comes back up fine.  The last
> > time
> > this happened, I was able to move jobs back into the directory a
> > few at
> > a time and keep restarting the pbs_server until I isolated the few
> > jobs
> > that were causing the server to hang.  Checking the files, all of
> > these
> > jobs were running on two nodes that had crashed.
> > 
> >  
> > 
> > So, in essence, a pbs_mom node crashed and took down the entire
> > cluster
> > with it.  As I said, we’ve seen this happen twice now.  Has anyone
> > else
> > seen this?
> 
> The issue is that the dis_tcp_wflush() function can hang for a
> loooong
> time. The server will wait until all data are sent, which can be
> hours,
> if the other side is slow enough. Also this will hang until timeouts
> occur when the other side is dead.
> 
> The patch I included is what we have done with the tcp_dis.c file.
> Sorry, that the patch isn't clean, but unfortunatelly, we have a lot
> of
> fixes and I don't really have the time to dig one specific out.
> 
> Specificially ignore the GSSAPI (kerberos) stuff and concentrate on
> the
> dis_tcp_wflush() function. The alarms are the important stuff.
> 
> We are using 60 seconds timeouts, you should tailor that towards your
> cluster, the most stuff that will get send here are qstat replies,
> which
> for 8000 jobs should be somewhere around 100MB of data.
> 
> --
> Mgr. Simon Toth
> 

I just looked at Simon's patch and it is in the right direction. If the problem you have is that the tcp calls do not return promptly then you can be waiting a long time (10800 seconds by default).

Regards

Ken Nielson
Adaptive Computing


More information about the torqueusers mailing list