[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Mon Nov 26 06:02:32 MST 2012


Dear list,

I've raised the loglevel on one of the clients to loglevel=7 in order to 
collect more information on the event.

maui just assumes a successful completion of the job.

pbs_server log regarding the job 686938[4] that suddenly becomes unknown 
at 00:15:21:
[snip]
/var/spool/torque/server_logs/20121125:11/25/2012 
10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at request 
of maui at localhost.localdomain
/var/spool/torque/server_logs/20121125:11/25/2012 
10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: 
setting job 686938[4].cheops10 state from QUEUED-QUEUED to 
RUNNING-PRERUN (4-40)
/var/spool/torque/server_logs/20121125:11/25/2012 
10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported 
success for job after 1 seconds (dest=???), rc=0
/var/spool/torque/server_logs/20121125:11/25/2012 
10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate: 
setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to 
RUNNING-RUNNING (4-42)
/var/spool/torque/server_logs/20121125:11/25/2012 
10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending email: 
User does not want mail of this type.
/var/spool/torque/server_logs/20121125:11/25/2012 
00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: 
setting job 686938[4].cheops10 state from RUNNING-RUNNING to 
QUEUED-SUBSTATE55 (1-55)
/var/spool/torque/server_logs/20121126:11/26/2012 
00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate: 
setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to 
EXITING-SUBSTATE55 (5-55)
/var/spool/torque/server_logs/20121126:11/26/2012 
00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from 
smp, state EXITING
/var/spool/torque/server_logs/20121126:11/26/2012 
00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray 
job 686938[4].cheops10 found on cheops11801
[snap]

pbs_mom log at loglevel 7, shortly before the kill:
[snip]
11/26/2012 00:14:26;0008;   pbs_mom.26562;Req;send_sisters;sending 
command POLL_JOB for job 686938[4].cheops10 (7)
11/26/2012 00:15:02;0002; 
pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending 
to server "jobs=686659[1].cheops10 686694[8].cheops10 
686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 
686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 
686941[130].cheops10 686941[132].cheops10 686941[133].cheops10 
686941[150].cheops10 686941[151].cheops10 686941[153].cheops10 
686941[186].cheops10 686941[187].cheops10 686941[188].cheops10 
686941[212].cheops10 687103[1].cheops10 687224.cheops10 
687252[1].cheops10 687279[1].cheops10 687279[2].cheops10 
687279[3].cheops10 687334[1].cheops10 687334[2].cheops10 
687367[1].cheops10 687367[2].cheops10 687366.cheops10"
11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:11;0008;   pbs_mom.26562;Req;send_sisters;sending 
command POLL_JOB for job 686938[4].cheops10 (7)
11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:15:56;0008;   pbs_mom.26562;Req;send_sisters;sending 
command POLL_JOB for job 686938[4].cheops10 (7)
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:00;0008; 
pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in 
non-exiting substate RUNNING, no obit sent at this time
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop 
start - jobid = 686938[4].cheops10
11/26/2012 00:16:01;0080; 
pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 (child 
pid=4214)
11/26/2012 00:16:02;0002; 
pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending 
to server "jobs=686659[1].cheops10 686694[8].cheops10 
686717[10].cheops10 686938[2].cheops10 686938[3].cheops10 
686938[4].cheops10 686941[2].cheops10 686941[128].cheops10 
686941[130].cheops10 686941[132].cheops10 686941[150].cheops10 
686941[151].cheops10 686941[153].cheops10 686941[186].cheops10 
686941[187].cheops10 686941[188].cheops10 686941[212].cheops10 
687103[1].cheops10 687224.cheops10 687252[1].cheops10 687279[1].cheops10 
687279[2].cheops10 687279[3].cheops10 687334[1].cheops10 
687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
11/26/2012 00:16:02;0008; 
pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal SIGKILL
11/26/2012 00:16:02;0008;   pbs_mom.26562;Job;kill_job;req_signaljob: 
sending signal 9, "KILL" to job 686938[4].cheops10, reason: killing job
[snap]

As we can see, the pbs_mom sends complete job lists to the pbs_server 
right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting 
(00:15:21), where the job 686938[4].cheops10 is included. However, the 
server claims that the job is unknown/doesn't exist and sends a kill 
command which the pbs_mom then executes (0:16:02).

I hope this helps in tracking down the bug.

Regards,
Lech Nieroda


On 22.11.2012 16:46, Lech Nieroda wrote:
> Dear list,
>
> we have another serious problem since our upgrade to Torque 4.1.3. We
> are using it with Maui 3.3.1. The problem in a nutshell: some few,
> random jobs are suddenly "unknown" to the server, it changes their
> status to EXITING-SUBSTATE55 and requests a silent kill on the compute
> nodes. The job then dies, the processes are killed on the node, there is
> no "Exit_status" in the server-log, no entry in maui/stats, no
> stdout/stderr files. The users are, understandably, not amused.
>
> It doesn't seem to be user or application specific. Even a single
> instance from a job array can get blown away in this way while all other
> instances end normally.
>
> Here are some logs of such a job (681684[35]):
>
> maui just assumes a successful completion:
> [snip]
> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
> 1.000000, 1
> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from
> the queue.  assuming successful completion
> 11/21 19:25:55 MJobProcessCompleted(681684[35])
> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
> 10903  PS: 10903  A: 0.063096
> 11/21 19:25:55 MJobSendFB(681684[35])
> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> 11/21 19:25:55 MJobRemove(681684[35])
> 11/21 19:25:55 MJobDestroy(681684[35])
> [snap]
>
> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
> unknown (grepped by JobID from the server logs):
> [snip]
> 11/21/2012
> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> RUNNING-RUNNING (4-42)
> 11/21/2012
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> QUEUED-SUBSTATE55 (1-55)
> 11/21/2012
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> EXITING-SUBSTATE55 (5-55)
> 11/21/2012
> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
> smp, state EXITING
> 11/21/2012
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray
> job 681684[35].cheops10 found on cheops21316
> [snap]
>
> pbs_client just kills the processes:
> [snip]
> 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
> 681684[35].cheops10 started, pid = 17452
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> (pid=17452/state=R) after sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> (pid=17452/state=Z) after sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> (pid=17692/state=R) after sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> (pid=17703/state=R) after sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008;
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> (pid=17731/state=R) after sig 15
> 11/21/2012 19:25:15;0080;
> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> 681684[35].cheops10 task 1 terminated, sid=17452
> 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
> was terminated
> 11/21/2012 19:25:50;0001;
> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
> deleting locally
> 11/21/2012 19:25:50;0080;
> pbs_mom.32254;Job;681684[35].cheops10;removed job script
> [snap]
>
> Sometimes, the pbs_mom logs include this message before the killing starts:
> [snip]
> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
> type=StatusJob, from PBS_Server at cheops10
> [snap]
>
> And finally, some job informations given to epilogue:
> [snip]
> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> userid=hthiele0,
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> queue=smp, account=ccg-ngs, exitcode=271
> [snap]
>
> This happens rarely (about 1 in 3000). However, silent deletions of
> random jobs aren't exactly a trifling matter.
> I could try to disable the mom_job_sync option, which could perhaps
> prevent the process killing of unknown jobs, but it would also leave
> corrupt/pre-execution jobs alive.
>
> Can this be fixed?
>
> On a side-note, here are some further, minor Bugs I've noticed in the
> Torque 4.1.3. Version:
> - the epilogue script is usually invoked twice and sometimes even
> several times
> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2,
> then the number of "tasks" as seen by qstat is zero
> - there have been some API changes between Torque 2.x and Torque 4.x, so
> that two maui calls had to be altered in order to build against Torque
> 4.x (get_svrport, openrm).
>
>
> Regards,
> Lech Nieroda
>


-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list