[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Thu Nov 22 08:46:56 MST 2012


Dear list,

we have another serious problem since our upgrade to Torque 4.1.3. We 
are using it with Maui 3.3.1. The problem in a nutshell: some few, 
random jobs are suddenly "unknown" to the server, it changes their 
status to EXITING-SUBSTATE55 and requests a silent kill on the compute 
nodes. The job then dies, the processes are killed on the node, there is 
no "Exit_status" in the server-log, no entry in maui/stats, no 
stdout/stderr files. The users are, understandably, not amused.

It doesn't seem to be user or application specific. Even a single 
instance from a job array can get blown away in this way while all other 
instances end normally.

Here are some logs of such a job (681684[35]):

maui just assumes a successful completion:
[snip]
11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000, 
1.000000, 1
11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from 
the queue.  assuming successful completion
11/21 19:25:55 MJobProcessCompleted(681684[35])
11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T: 
10903  PS: 10903  A: 0.063096
11/21 19:25:55 MJobSendFB(681684[35])
11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
11/21 19:25:55 MJobRemove(681684[35])
11/21 19:25:55 MJobDestroy(681684[35])
[snap]

pbs_server decides at 19:25:11, after 3 hours runtime, that the job is 
unknown (grepped by JobID from the server logs):
[snip]
11/21/2012 
16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: 
setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to 
RUNNING-RUNNING (4-42)
11/21/2012 
19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: 
setting job 681684[35].cheops10 state from RUNNING-RUNNING to 
QUEUED-SUBSTATE55 (1-55)
11/21/2012 
19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: 
setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to 
EXITING-SUBSTATE55 (5-55)
11/21/2012 
19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from 
smp, state EXITING
11/21/2012 
19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray 
job 681684[35].cheops10 found on cheops21316
[snap]

pbs_client just kills the processes:
[snip]
11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job 
681684[35].cheops10 started, pid = 17452
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task 
1 gracefully with sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
(pid=17452/state=R) after sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
(pid=17452/state=Z) after sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task 
1 gracefully with sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
(pid=17692/state=R) after sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task 
1 gracefully with sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
(pid=17703/state=R) after sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task 
1 gracefully with sig 15
11/21/2012 19:25:14;0008; 
pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
(pid=17731/state=R) after sig 15
11/21/2012 19:25:15;0080; 
pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job 
681684[35].cheops10 task 1 terminated, sid=17452
11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job 
was terminated
11/21/2012 19:25:50;0001; 
pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server, 
deleting locally
11/21/2012 19:25:50;0080; 
pbs_mom.32254;Job;681684[35].cheops10;removed job script
[snap]

Sometimes, the pbs_mom logs include this message before the killing starts:
[snip]
Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, 
type=StatusJob, from PBS_Server at cheops10
[snap]

And finally, some job informations given to epilogue:
[snip]
Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: 
681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: 
681684[35].cheops10,hthiele0,cheops21316,Job Information: 
userid=hthiele0, 
resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', 
resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', 
queue=smp, account=ccg-ngs, exitcode=271
[snap]

This happens rarely (about 1 in 3000). However, silent deletions of 
random jobs aren't exactly a trifling matter.
I could try to disable the mom_job_sync option, which could perhaps 
prevent the process killing of unknown jobs, but it would also leave 
corrupt/pre-execution jobs alive.

Can this be fixed?

On a side-note, here are some further, minor Bugs I've noticed in the 
Torque 4.1.3. Version:
- the epilogue script is usually invoked twice and sometimes even 
several times
- when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2, 
then the number of "tasks" as seen by qstat is zero
- there have been some API changes between Torque 2.x and Torque 4.x, so 
that two maui calls had to be altered in order to build against Torque 
4.x (get_svrport, openrm).


Regards,
Lech Nieroda

-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list