[torqueusers] Serious torque failure problems

Marc Langlois marc at keyseismic.com
Thu Aug 11 09:24:15 MDT 2005


On Thu, 2005-08-11 at 07:52, Chris Johnson wrote:
>      Hi,
> 
>      Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed 
> of AMDs, P4s, XEONs, and Opterons, several hundred total.
> 
>      At the moment torque is becoming useless.  We are using the out of 
> box the C scheduler and it keeps dieing completely.  We are seeing jobs
> running on nodes which they are not listed in any log as officially having
> been run on.  The scheduler gets hung up on problem nodes and tries to
> run many many jobs on the bad node.  The scheduler/server can't pick up
> on the fact that a node is bad due to a situation in which mom seems to 
> respond but other linux services have failed.  And my boss is about ready
> to trash torque.  Can't blame him, we're spending way too much time
> maintaining this cluster.  Researchers aren't any too thrilled either.  
> 
>      As fas as I know torque is used in a lot of places.  And I don't hear
> about these problems other places.  What the hell is going on?  I REALLY
> need to get this corrected.  And I'll provide any information I can.  
> But my community is about ready to start telling people what not to use for 
> cluster operations.  
> 
>      Help would be GREATLY appreciated.  
> 

Hi Chris,

I had a similar experience with PBS a few years ago. Although it could
be a bit dated, I found that the default C scheduler worked fine for
testing, but as soon as I rolled it into production, it failed
miserably. Switching to the Maui scheduler solved all my problems.

Another thing that helped was using the "--disable-rpp" flag when
running configure for torque. It seems that RPP was flooding the network
with UDP traffic that hung the PBS server and scheduler (we were running
on Solaris).

As far finding an alternative system, I recently gave SGE a try to see
how it compared to torque. Could be that I'm familiar with the PBS way,
but SGE had it's own set of quirks that I couldn't get around, so I
dropped it and came back to torque.

Hope this helps. Good luck!

Marc.
-- 
Marc Langlois
Key Seismic Solutions Ltd., Calgary, AB, Canada.
marc at keyseismic dot com



More information about the torqueusers mailing list