[torqueusers] Serious torque failure problems

Chris Johnson johnson at nmr.mgh.harvard.edu
Thu Aug 11 07:52:45 MDT 2005


     Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed 
of AMDs, P4s, XEONs, and Opterons, several hundred total.

     At the moment torque is becoming useless.  We are using the out of 
box the C scheduler and it keeps dieing completely.  We are seeing jobs
running on nodes which they are not listed in any log as officially having
been run on.  The scheduler gets hung up on problem nodes and tries to
run many many jobs on the bad node.  The scheduler/server can't pick up
on the fact that a node is bad due to a situation in which mom seems to 
respond but other linux services have failed.  And my boss is about ready
to trash torque.  Can't blame him, we're spending way too much time
maintaining this cluster.  Researchers aren't any too thrilled either.  

     As fas as I know torque is used in a lot of places.  And I don't hear
about these problems other places.  What the hell is going on?  I REALLY
need to get this corrected.  And I'll provide any information I can.  
But my community is about ready to start telling people what not to use for 
cluster operations.  

     Help would be GREATLY appreciated.  

Chris Johnson               |Internet: johnson at nmr.mgh.harvard.edu
Systems Administrator       |Web:      http://www.nmr.mgh.harvard.edu/~johnson
NMR Center                  |Voice:    617.726.0949
Mass. General Hospital      |FAX:      617.726.7422
149 (2301) 13th Street      |"The two most abundant things in the Universe
Charlestown, MA., 02129 USA | are hydrogen and stupidity."  Harlan Ellison

More information about the torqueusers mailing list