[torqueusers] Torque behaving strangely,
jobs becoming blocked indefinitely
Leigh Gordon
Leigh.Gordon at utas.edu.au
Tue Aug 14 19:10:11 MDT 2007
Hi everyone,
(I'm not 100% sure on whether this is a problem with maui or torque, so please
tell me if the maui list would be more appropriate!)
We've got an issue which has only occurred since increasing the number of CPUs
on our Altix 4700 to 128(all in a single node). When the queue is
full(usually when there are a lot of single CPU jobs), there seems to be
communications issue between maui/torque/MOM which result in the system doing
a couple of strange annoying things, which are:
A) node reports as down, with 0 processors active(which isn't the case, as the
running jobs continue running fine and the processors are definitely active)
whiteout:~ # showq
<joblist snipped>
101 Active Jobs 0 of 0 Processors Active (0.00%)
------------------------------------------------------------------------------------------------------------------------------
whiteout:~ # checknode -v whiteout
checking node whiteout
State: Down (in current state for 00:00:00)
Configured Resources: PROCS: 118 MEM: 305G SWAP: 284G DISK: 5020G
Utilized Resources: PROCS: 118 DISK: 3783G
Dedicated Resources: PROCS: 117 MEM: 70G
Opsys: SLES9 Arch: ia64
Speed: 1.00 Load: 118.180
Location: Partition: DEFAULT Frame/Slot: 1/1
Network: [DEFAULT]
Features: [batch]
Attributes: [Batch]
Classes: [batch 1:118]
Total Time: 73:04:30:05 Up: 71:06:58:29 (97.41%) Active: 71:06:53:45
(97.40%)
<joblist snipped>
ALERT: jobs active on node but state is Down
ALERT: node is in state Down but load is high (118.180)
-------------------------------------------------------------------------------------------------------------------
and B) jobs get placed in the "BatchHold" state and stay there. Tracing the
history of an example of one of the affected jobs reveals the following:
-------------------------------------------------------------------------------------------------------------------
whiteout:~ # grep 12527 /var/spool/torque/*_logs/*
/var/spool/torque/mom_logs/20070224:02/24/2007 13:17:45;0008;
pbs_mom;Job;2994.whiteout.sf.utas.edu.au;kill_task: killing pid 12527 task 1
with sig 15
/var/spool/torque/server_logs/20070814:08/14/2007
19:19:47;0100;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;enqueuing into
batch, state 1 hop 1
/var/spool/torque/server_logs/20070814:08/14/2007
19:19:47;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Queued at
request of wpsijp at whiteout.sf.utas.edu.au, owner =
wpsijp at whiteout.sf.utas.edu.au, job name = M2.pbs, queue = batch
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at
request of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request
of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:40;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:53;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in
send_job, child timed-out attempting to start job
12527.whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:53;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported
failure for job after 21 seconds (dest=whiteout), rc=10
/var/spool/torque/server_logs/20070815:08/15/2007
01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job,
MOM rejected/timeout
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at
request of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request
of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:14;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:27;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in
send_job, child timed-out attempting to start job
12527.whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:27;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported
failure for job after 21 seconds (dest=whiteout), rc=10
/var/spool/torque/server_logs/20070815:08/15/2007
01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job,
MOM rejected/timeout
------------------------------------------------------------------------------------------
whiteout:~ # checkjob -v 12527
checking job 12527 (RM job '12527.whiteout.sf.utas.edu.au')
State: Idle
Creds: user:wpsijp group:users class:batch qos:DEFAULT
WallTime: 00:00:00 of 6:00:00
SubmitTime: Tue Aug 14 19:19:47
(Time Queued Total: 15:24:13 Eligible: 00:00:00)
StartDate: -9:26:53 Wed Aug 15 01:17:07
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
Exec: '' ExecSize: 0 ImageSize: 0
Dedicated Resources Per Task: PROCS: 1 MEM: 500M
NodeAccess: SHARED
NodeCount: 1
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 2
PartitionMask: [ALL]
SystemQueueTime: Wed Aug 15 01:22:52
Flags: RESTARTABLE
Holds: Batch (hold reason: NoResources)
Messages: exceeds available partition procs
PE: 1.00 StartPriority: 1
cannot select job 12527 for partition DEFAULT (job hold active)
------------------------------------------------------------------------------------------
whiteout:~ # tracejob -v 12527
/var/spool/torque/server_priv/accounting/20070815: No matching job records
located
/var/spool/torque/server_logs/20070815: Successfully located matching job
records
/var/spool/torque/mom_logs/20070815: No matching job records located
/var/spool/torque/sched_logs/20070815: No such file or directory
Job: 12527.whiteout.sf.utas.edu.au
08/15/2007 01:11:32 S Job Modified at request of
maui at whiteout.sf.utas.edu.au
08/15/2007 01:11:32 S Job Run at request of maui at whiteout.sf.utas.edu.au
08/15/2007 01:11:40 S send of job to whiteout failed error = 15020
08/15/2007 01:11:53 S send of job to whiteout failed error = 15020
08/15/2007 01:11:53 S child reported failure for job after 21 seconds
(dest=whiteout), rc=10
08/15/2007 01:11:53 S unable to run job, MOM rejected/timeout
08/15/2007 01:17:06 S Job Modified at request of
maui at whiteout.sf.utas.edu.au
08/15/2007 01:17:06 S Job Run at request of maui at whiteout.sf.utas.edu.au
08/15/2007 01:17:14 S send of job to whiteout failed error = 15020
08/15/2007 01:17:27 S send of job to whiteout failed error = 15020
08/15/2007 01:17:27 S child reported failure for job after 21 seconds
(dest=whiteout), rc=10
08/15/2007 01:17:27 S unable to run job, MOM rejected/timeout
------------------------------------------------------------------------------------------
It does this periodically during the day(very frequently in fact, so I'm not
sure whether it's linked to a periodic scheduled event every few minutes?).
The node returns to a normal "running" state and the CPUs report as being
used, but it inevitably happens again constantly.
Can anyone shed some light on where this problem might be and what can be done
to resolve it? I can provide more information if required!
The main issue that it causes is having to manually release blocked jobs when
there are available CPUs(which then runs the jobs fine), but obviously it
should be able to automatically manage the queue without resorting to this
intervention! Thanks
Regards,
Leigh Gordon
High Performance Computing Systems Administrator
IT Resources, University of Tasmania
Phone: 03 6226 6389
http://www.tpac.org.au
More information about the torqueusers
mailing list