[torqueusers] Torque behaving strangely, jobs becoming blocked indefinitely

Leigh Gordon Leigh.Gordon at utas.edu.au
Tue Aug 14 19:10:11 MDT 2007


Hi everyone,

(I'm not 100% sure on whether this is a problem with maui or torque, so please 
tell me if the maui list would be more appropriate!)

We've got an issue which has only occurred since increasing the number of CPUs 
on our Altix 4700 to 128(all in a single node).  When the queue is 
full(usually when there are a lot of single CPU jobs), there seems to be 
communications issue between maui/torque/MOM which result in the system doing 
a couple of strange annoying things, which are:

A) node reports as down, with 0 processors active(which isn't the case, as the 
running jobs continue running fine and the processors are definitely active)

whiteout:~ # showq
<joblist snipped>
101 Active Jobs       0 of    0 Processors Active (0.00%)

------------------------------------------------------------------------------------------------------------------------------
whiteout:~ # checknode -v whiteout


checking node whiteout

State:      Down  (in current state for 00:00:00)
Configured Resources: PROCS: 118  MEM: 305G  SWAP: 284G  DISK: 5020G
Utilized   Resources: PROCS: 118  DISK: 3783G
Dedicated  Resources: PROCS: 117  MEM: 70G
Opsys:         SLES9  Arch:        ia64
Speed:      1.00  Load:      118.180
Location:   Partition: DEFAULT  Frame/Slot:  1/1
Network:    [DEFAULT]
Features:   [batch]
Attributes: [Batch]
Classes:    [batch 1:118]

Total Time: 73:04:30:05  Up: 71:06:58:29 (97.41%)  Active: 71:06:53:45 
(97.40%)
<joblist snipped>
ALERT:  jobs active on node but state is Down
ALERT:  node is in state Down but load is high (118.180)

-------------------------------------------------------------------------------------------------------------------

and B) jobs get placed in the "BatchHold" state and stay there. Tracing the 
history of an example of one of the affected jobs reveals the following:

-------------------------------------------------------------------------------------------------------------------
whiteout:~ # grep 12527 /var/spool/torque/*_logs/*
/var/spool/torque/mom_logs/20070224:02/24/2007 13:17:45;0008;   
pbs_mom;Job;2994.whiteout.sf.utas.edu.au;kill_task: killing pid 12527 task 1 
with sig 15
/var/spool/torque/server_logs/20070814:08/14/2007 
19:19:47;0100;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;enqueuing into 
batch, state 1 hop 1
/var/spool/torque/server_logs/20070814:08/14/2007 
19:19:47;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Queued at 
request of wpsijp at whiteout.sf.utas.edu.au, owner = 
wpsijp at whiteout.sf.utas.edu.au, job name = M2.pbs, queue = batch
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at 
request of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request 
of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:40;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:53;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in 
send_job, child timed-out attempting to start job 
12527.whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:53;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported 
failure for job after 21 seconds (dest=whiteout), rc=10
/var/spool/torque/server_logs/20070815:08/15/2007 
01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job, 
MOM rejected/timeout
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at 
request of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request 
of maui at whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:14;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
whiteout failed error = 15020
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:27;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in 
send_job, child timed-out attempting to start job 
12527.whiteout.sf.utas.edu.au
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:27;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported 
failure for job after 21 seconds (dest=whiteout), rc=10
/var/spool/torque/server_logs/20070815:08/15/2007 
01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job, 
MOM rejected/timeout


------------------------------------------------------------------------------------------
whiteout:~ # checkjob -v 12527


checking job 12527 (RM job '12527.whiteout.sf.utas.edu.au')

State: Idle
Creds:  user:wpsijp  group:users  class:batch  qos:DEFAULT
WallTime: 00:00:00 of 6:00:00
SubmitTime: Tue Aug 14 19:19:47
  (Time Queued  Total: 15:24:13  Eligible: 00:00:00)

StartDate: -9:26:53  Wed Aug 15 01:17:07
Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1  MEM: 500M
NodeAccess: SHARED
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 2
PartitionMask: [ALL]
SystemQueueTime: Wed Aug 15 01:22:52

Flags:       RESTARTABLE

Holds:    Batch  (hold reason:  NoResources)
Messages:  exceeds available partition procs
PE:  1.00  StartPriority:  1
cannot select job 12527 for partition DEFAULT (job hold active)

------------------------------------------------------------------------------------------
whiteout:~ # tracejob -v 12527
/var/spool/torque/server_priv/accounting/20070815: No matching job records 
located
/var/spool/torque/server_logs/20070815: Successfully located matching job 
records
/var/spool/torque/mom_logs/20070815: No matching job records located
/var/spool/torque/sched_logs/20070815: No such file or directory

Job: 12527.whiteout.sf.utas.edu.au

08/15/2007 01:11:32  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/15/2007 01:11:32  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/15/2007 01:11:40  S    send of job to whiteout failed error = 15020
08/15/2007 01:11:53  S    send of job to whiteout failed error = 15020
08/15/2007 01:11:53  S    child reported failure for job after 21 seconds 
(dest=whiteout), rc=10
08/15/2007 01:11:53  S    unable to run job, MOM rejected/timeout
08/15/2007 01:17:06  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/15/2007 01:17:06  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/15/2007 01:17:14  S    send of job to whiteout failed error = 15020
08/15/2007 01:17:27  S    send of job to whiteout failed error = 15020
08/15/2007 01:17:27  S    child reported failure for job after 21 seconds 
(dest=whiteout), rc=10
08/15/2007 01:17:27  S    unable to run job, MOM rejected/timeout
------------------------------------------------------------------------------------------

It does this periodically during the day(very frequently in fact, so I'm not 
sure whether it's linked to a periodic scheduled event every few minutes?). 
The node returns to a normal "running" state and the CPUs report as being 
used, but it inevitably happens again constantly.

Can anyone shed some light on where this problem might be and what can be done 
to resolve it? I can provide more information if required!

The main issue that it causes is having to manually release blocked jobs when 
there are available CPUs(which then runs the jobs fine), but obviously it 
should be able to automatically manage the queue without resorting to this 
intervention! Thanks

Regards,

Leigh Gordon
High Performance Computing Systems Administrator
IT Resources, University of Tasmania
Phone: 03 6226 6389
http://www.tpac.org.au


More information about the torqueusers mailing list