[torqueusers] Serious torque failure problems
Simon Robbins
robbins at physik.uni-wuppertal.de
Fri Aug 12 07:12:12 MDT 2005
Hello,
On Thu, 11 Aug 2005, Chris Johnson wrote:
> Hi,
>
> Running torque-1.2.0p2 here on CentOS 4.X across a cluster composed
> of AMDs, P4s, XEONs, and Opterons, several hundred total.
>
> At the moment torque is becoming useless. We are using the out of
> box the C scheduler and it keeps dieing completely. We are seeing jobs
> running on nodes which they are not listed in any log as officially having
> been run on. The scheduler gets hung up on problem nodes and tries to
> run many many jobs on the bad node. The scheduler/server can't pick up
> on the fact that a node is bad due to a situation in which mom seems to
> respond but other linux services have failed. And my boss is about ready
> to trash torque. Can't blame him, we're spending way too much time
> maintaining this cluster. Researchers aren't any too thrilled either.
I am running torque-1.2.0p4 with Maui on a large
cluster of AMD opterons. We experience similar
problems and are trying to track them down - they
appear to be related to breaks in the network
between the mom and the pbs_server. I plan to try
running with "--disble-rpp" soon.
I get jobs which do not run - "unable to run job,
MOM rejected/rc=1", and are then tied to a specific
node. `qalter -lneednodes= JOBID` does not clear
this state and the job becomes stuck waiting for
this nodes. Such a node gobles lots of jobs, is
there a way to take them out of this "stuck" state?
Is there a way to take such a node offline
_automatically_? (I tried to write a patch to do
this but found it too complicated)
> As fas as I know torque is used in a lot of places. And I don't hear
> about these problems other places. What the hell is going on? I REALLY
> need to get this corrected. And I'll provide any information I can.
> But my community is about ready to start telling people what not to use for
> cluster operations.
>
> Help would be GREATLY appreciated.
I attach some output from our server related to this
problem. Anyone have any ideas?
Dr. Simon Robbins robbins at physik.uni-wuppertal.de
==================================================================
BU Wuppertal, FB-C, (Room. F11.08)
Gaussstrasse, 20, Phone : +49 (0)202 439 3750
D-42119 Wuppertal Fax : +49 (0)202 439 2662
Job is allocated node: n471 but state is queued:
59398.alicenext wiebusch idle mu.z-610.a -- -- -- -- 120:0 Q --
*****************************************************
# tracejob -n 28 59398
Job: 59398.alicenext.alicenext
07/25/2005 10:24:30 S Job Queued at request of wiebusch at sam1.alicenext, owner =
wiebusch at sam1.alicenext, job name = mu.z-610.a10, queue =
idle
07/25/2005 10:24:30 A queue=idle
07/25/2005 11:01:32 S Job Modified at request of root at ALiCEnext.alicenext
07/25/2005 11:01:32 S Job Run at request of root at ALiCEnext.alicenext
07/25/2005 11:01:33 S unable to run job, MOM rejected/rc=1
08/09/2005 01:10:36 S Requeueing job, substate: 10 Requeued in queue: idle
*****************************************************
mom_log:
07/25/2005 11:01:32;0001; pbs_mom;Svr;pbs_mom;Success (0) in req_jobscript, job in unexpected state 'TRANSICM'
07/25/2005 11:01:32;0080; pbs_mom;Req;req_reject;Reject reply code=15004(Invalid request REJHOST=node-471.alicenext.uni-wuppertal.de MSG=job in unexpected state 'TRANSICM'), aux=0, type=JobScript, from PBS_Server at alicenext.alicenext
*****************************************************
server_log:
07/25/2005 11:01:32;0008;PBS_Server;Job;59398.alicenext.alicenext;Job Modified at request of root at ALiCEnext.alicenext
07/25/2005 11:01:32;0008;PBS_Server;Job;59398.alicenext.alicenext;Job Run at request of root at ALiCEnext.alicenext
07/25/2005 11:01:33;0008;PBS_Server;Job;59398.alicenext.alicenext;unable to run job, MOM rejected/rc=1
07/25/2005 11:01:33;0080;PBS_Server;Req;req_reject;Reject reply code=15041(Execution server rejected request MSG=send failed, STARTING), aux=0, type=RunJob, from root at ALiCEnext.alicenext
More information about the torqueusers
mailing list