[torqueusers] Serious torque failure problems

Simon Robbins robbins at physik.uni-wuppertal.de
Fri Aug 12 07:12:12 MDT 2005


On Thu, 11 Aug 2005, Chris Johnson wrote:

>      Hi,
>      Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed 
> of AMDs, P4s, XEONs, and Opterons, several hundred total.
>      At the moment torque is becoming useless.  We are using the out of 
> box the C scheduler and it keeps dieing completely.  We are seeing jobs
> running on nodes which they are not listed in any log as officially having
> been run on.  The scheduler gets hung up on problem nodes and tries to
> run many many jobs on the bad node.  The scheduler/server can't pick up
> on the fact that a node is bad due to a situation in which mom seems to 
> respond but other linux services have failed.  And my boss is about ready
> to trash torque.  Can't blame him, we're spending way too much time
> maintaining this cluster.  Researchers aren't any too thrilled either.  

I am running torque-1.2.0p4 with Maui on a large 
cluster of AMD opterons.  We experience similar 
problems and are trying to track them down - they 
appear to be related to breaks in the network 
between the mom and the pbs_server.  I plan to try 
running with "--disble-rpp" soon.

I get jobs which do not run - "unable to run job, 
MOM rejected/rc=1", and are then tied to a specific 
node.  `qalter -lneednodes= JOBID` does not clear 
this state and the job becomes stuck waiting for 
this nodes.  Such a node gobles lots of jobs, is 
there a way to take them out of this "stuck" state?
Is there a way to take such a node offline 
_automatically_?  (I tried to write a patch to do 
this but found it too complicated)

>      As fas as I know torque is used in a lot of places.  And I don't hear
> about these problems other places.  What the hell is going on?  I REALLY
> need to get this corrected.  And I'll provide any information I can.  
> But my community is about ready to start telling people what not to use for 
> cluster operations.  
>      Help would be GREATLY appreciated.  

I attach some output from our server related to this 
problem.  Anyone have any ideas?

Dr. Simon Robbins                  robbins at physik.uni-wuppertal.de
BU Wuppertal, FB-C,                                 (Room. F11.08)
Gaussstrasse, 20,                      Phone : +49 (0)202 439 3750
D-42119 Wuppertal                      Fax   : +49 (0)202 439 2662

Job is allocated node: n471  but state is queued:

59398.alicenext wiebusch idle     mu.z-610.a    --   --  --    --  120:0 Q   --

# tracejob -n 28 59398

Job: 59398.alicenext.alicenext

07/25/2005 10:24:30  S    Job Queued at request of wiebusch at sam1.alicenext, owner =
                          wiebusch at sam1.alicenext, job name = mu.z-610.a10, queue =
07/25/2005 10:24:30  A    queue=idle
07/25/2005 11:01:32  S    Job Modified at request of root at ALiCEnext.alicenext
07/25/2005 11:01:32  S    Job Run at request of root at ALiCEnext.alicenext
07/25/2005 11:01:33  S    unable to run job, MOM rejected/rc=1
08/09/2005 01:10:36  S    Requeueing job, substate: 10 Requeued in queue: idle

07/25/2005 11:01:32;0001;   pbs_mom;Svr;pbs_mom;Success (0) in req_jobscript, job in unexpected state 'TRANSICM'
07/25/2005 11:01:32;0080;   pbs_mom;Req;req_reject;Reject reply code=15004(Invalid request REJHOST=node-471.alicenext.uni-wuppertal.de MSG=job in unexpected state 'TRANSICM'), aux=0, type=JobScript, from PBS_Server at alicenext.alicenext

07/25/2005 11:01:32;0008;PBS_Server;Job;59398.alicenext.alicenext;Job Modified at request of root at ALiCEnext.alicenext
07/25/2005 11:01:32;0008;PBS_Server;Job;59398.alicenext.alicenext;Job Run at request of root at ALiCEnext.alicenext
07/25/2005 11:01:33;0008;PBS_Server;Job;59398.alicenext.alicenext;unable to run job, MOM rejected/rc=1
07/25/2005 11:01:33;0080;PBS_Server;Req;req_reject;Reject reply code=15041(Execution server rejected request MSG=send failed, STARTING), aux=0, type=RunJob, from root at ALiCEnext.alicenext

More information about the torqueusers mailing list