[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Wed Dec 12 09:57:55 MST 2012


Dear list,

the problem continues to persist: ca. 1 every 1000-2000 jobs is killed. 
We've noticed that it almost disappears when the pbs_server loglevel is 
raised to 7 (ca. one occurrence of the error per week). Considering that 
higher loglevels simply generate more output and may thus prolong the 
execution of some tasks for a minuscule amount of time, this may turn 
out to be a race condition issue.
We can't permanently set the loglevel to 7 though, because of two 
problems: the daily logfile size is ca. 30GB and pbs_server crashes 
every 2-3 days.

The error: initially, the server sends an abort for no apparent reason, 
however it mails the user "Job does not exist on node". Even though the 
pbs_mom sends the correct joblist to the pbs_server, it receives a 
SIGKILL signal for the job (here with the jobid 721711, no dependencies, 
etc) without an error message of its own:

[pbs_server snip]
12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting 
connect to host 172.18.1.220 port 15002
12/12/2012 
02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate: 
setting job 721711.cheops10 state from RUNNING-RUNNING to 
QUEUED-SUBSTATE55 (1-55)
[...]
12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing 
to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does 
not exist on node)
[pbs_server snap]

[pbs_mom snip]
12/12/2012 02:49:50;0002; 
pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending 
to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10 
722458.cheops10"
12/12/2012 02:49:50;0002; 
pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending 
to server "varattr= "
12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;read_tcp_reply;protocol: 4 
  version: 3  command:4  sock:8
12/12/2012 02:49:50;0002; 
pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent 
to cheops10
12/12/2012 02:49:50;0080;   pbs_mom.31686;Req;dis_request_read;decoding 
command SignalJob from PBS_Server
12/12/2012 02:49:50;0100;   pbs_mom.31686;Req;;Type SignalJob request 
received from PBS_Server at cheops10, sock=8
12/12/2012 02:49:50;0008; 
pbs_mom.31686;Job;mom_process_request;request type SignalJob from host 
cheops10 received
12/12/2012 02:49:50;0008; 
pbs_mom.31686;Job;mom_process_request;request type SignalJob from host 
cheops10 allowed
12/12/2012 02:49:50;0008; 
pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob on sd=8
12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;721711.cheops10;signaling 
job with signal SIGKILL
12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;kill_job;req_signaljob: 
sending signal 9, "KILL" to job 721711.cheops10, reason: killing job
[...]
12/12/2012 02:50:35;0080;   pbs_mom.31686;Svr;preobit_reply;top of 
preobit_reply
12/12/2012 02:50:35;0080; 
pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr 
worked, top of while loop
12/12/2012 02:50:35;0001; 
pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server, 
deleting locally
[pbs_mom snap]

Ideas? Fixes?

Regards,
Lech Nieroda



In case it helps to analyze the bug

On 26.11.2012 14:02, Lech Nieroda wrote:
> Dear list,
>
> I've raised the loglevel on one of the clients to loglevel=7 in order to
> collect more information on the event.
>
> maui just assumes a successful completion of the job.
>
> pbs_server log regarding the job 686938[4] that suddenly becomes unknown
> at 00:15:21:
> [snip]
> /var/spool/torque/server_logs/20121125:11/25/2012
> 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at request
> of maui at localhost.localdomain
> /var/spool/torque/server_logs/20121125:11/25/2012
> 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> setting job 686938[4].cheops10 state from QUEUED-QUEUED to
> RUNNING-PRERUN (4-40)
> /var/spool/torque/server_logs/20121125:11/25/2012
> 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported
> success for job after 1 seconds (dest=???), rc=0
> /var/spool/torque/server_logs/20121125:11/25/2012
> 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to
> RUNNING-RUNNING (4-42)
> /var/spool/torque/server_logs/20121125:11/25/2012
> 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending email:
> User does not want mail of this type.
> /var/spool/torque/server_logs/20121125:11/25/2012
> 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> setting job 686938[4].cheops10 state from RUNNING-RUNNING to
> QUEUED-SUBSTATE55 (1-55)
> /var/spool/torque/server_logs/20121126:11/26/2012
> 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to
> EXITING-SUBSTATE55 (5-55)
> /var/spool/torque/server_logs/20121126:11/26/2012
> 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from
> smp, state EXITING
> /var/spool/torque/server_logs/20121126:11/26/2012
> 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray
> job 686938[4].cheops10 found on cheops11801
> [snap]
>
> pbs_mom log at loglevel 7, shortly before the kill:
> [snip]
> 11/26/2012 00:14:26;0008;   pbs_mom.26562;Req;send_sisters;sending
> command POLL_JOB for job 686938[4].cheops10 (7)
> 11/26/2012 00:15:02;0002;
> pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending
> to server "jobs=686659[1].cheops10 686694[8].cheops10
> 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10
> 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10
> 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10
> 686941[212].cheops10 687103[1].cheops10 687224.cheops10
> 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10
> 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10
> 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
> 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:11;0008;   pbs_mom.26562;Req;send_sisters;sending
> command POLL_JOB for job 686938[4].cheops10 (7)
> 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:15:56;0008;   pbs_mom.26562;Req;send_sisters;sending
> command POLL_JOB for job 686938[4].cheops10 (7)
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:00;0008;
> pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in
> non-exiting substate RUNNING, no obit sent at this time
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> start - jobid = 686938[4].cheops10
> 11/26/2012 00:16:01;0080;
> pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 (child
> pid=4214)
> 11/26/2012 00:16:02;0002;
> pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending
> to server "jobs=686659[1].cheops10 686694[8].cheops10
> 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10
> 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10
> 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10
> 687103[1].cheops10 687224.cheops10 687252[1].cheops10 687279[1].cheops10
> 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10
> 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
> 11/26/2012 00:16:02;0008;
> pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal SIGKILL
> 11/26/2012 00:16:02;0008;   pbs_mom.26562;Job;kill_job;req_signaljob:
> sending signal 9, "KILL" to job 686938[4].cheops10, reason: killing job
> [snap]
>
> As we can see, the pbs_mom sends complete job lists to the pbs_server
> right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting
> (00:15:21), where the job 686938[4].cheops10 is included. However, the
> server claims that the job is unknown/doesn't exist and sends a kill
> command which the pbs_mom then executes (0:16:02).
>
> I hope this helps in tracking down the bug.
>
> Regards,
> Lech Nieroda
>
>
> On 22.11.2012 16:46, Lech Nieroda wrote:
>> Dear list,
>>
>> we have another serious problem since our upgrade to Torque 4.1.3. We
>> are using it with Maui 3.3.1. The problem in a nutshell: some few,
>> random jobs are suddenly "unknown" to the server, it changes their
>> status to EXITING-SUBSTATE55 and requests a silent kill on the compute
>> nodes. The job then dies, the processes are killed on the node, there is
>> no "Exit_status" in the server-log, no entry in maui/stats, no
>> stdout/stderr files. The users are, understandably, not amused.
>>
>> It doesn't seem to be user or application specific. Even a single
>> instance from a job array can get blown away in this way while all other
>> instances end normally.
>>
>> Here are some logs of such a job (681684[35]):
>>
>> maui just assumes a successful completion:
>> [snip]
>> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
>> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
>> 1.000000, 1
>> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from
>> the queue.  assuming successful completion
>> 11/21 19:25:55 MJobProcessCompleted(681684[35])
>> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
>> 10903  PS: 10903  A: 0.063096
>> 11/21 19:25:55 MJobSendFB(681684[35])
>> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
>> 11/21 19:25:55 MJobRemove(681684[35])
>> 11/21 19:25:55 MJobDestroy(681684[35])
>> [snap]
>>
>> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
>> unknown (grepped by JobID from the server logs):
>> [snip]
>> 11/21/2012
>> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
>> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
>> RUNNING-RUNNING (4-42)
>> 11/21/2012
>> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>> setting job 681684[35].cheops10 state from RUNNING-RUNNING to
>> QUEUED-SUBSTATE55 (1-55)
>> 11/21/2012
>> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
>> EXITING-SUBSTATE55 (5-55)
>> 11/21/2012
>> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
>> smp, state EXITING
>> 11/21/2012
>> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray
>> job 681684[35].cheops10 found on cheops21316
>> [snap]
>>
>> pbs_client just kills the processes:
>> [snip]
>> 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
>> 681684[35].cheops10 started, pid = 17452
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
>> 1 gracefully with sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> (pid=17452/state=R) after sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> (pid=17452/state=Z) after sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
>> 1 gracefully with sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> (pid=17692/state=R) after sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
>> 1 gracefully with sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> (pid=17703/state=R) after sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
>> 1 gracefully with sig 15
>> 11/21/2012 19:25:14;0008;
>> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> (pid=17731/state=R) after sig 15
>> 11/21/2012 19:25:15;0080;
>> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
>> 681684[35].cheops10 task 1 terminated, sid=17452
>> 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
>> was terminated
>> 11/21/2012 19:25:50;0001;
>> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
>> deleting locally
>> 11/21/2012 19:25:50;0080;
>> pbs_mom.32254;Job;681684[35].cheops10;removed job script
>> [snap]
>>
>> Sometimes, the pbs_mom logs include this message before the killing starts:
>> [snip]
>> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
>> type=StatusJob, from PBS_Server at cheops10
>> [snap]
>>
>> And finally, some job informations given to epilogue:
>> [snip]
>> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
>> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>> 681684[35].cheops10,hthiele0,cheops21316,Job Information:
>> userid=hthiele0,
>> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
>> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
>> queue=smp, account=ccg-ngs, exitcode=271
>> [snap]
>>
>> This happens rarely (about 1 in 3000). However, silent deletions of
>> random jobs aren't exactly a trifling matter.
>> I could try to disable the mom_job_sync option, which could perhaps
>> prevent the process killing of unknown jobs, but it would also leave
>> corrupt/pre-execution jobs alive.
>>
>> Can this be fixed?
>>
>> On a side-note, here are some further, minor Bugs I've noticed in the
>> Torque 4.1.3. Version:
>> - the epilogue script is usually invoked twice and sometimes even
>> several times
>> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2,
>> then the number of "tasks" as seen by qstat is zero
>> - there have been some API changes between Torque 2.x and Torque 4.x, so
>> that two maui calls had to be altered in order to build against Torque
>> 4.x (get_svrport, openrm).
>>
>>
>> Regards,
>> Lech Nieroda
>>
>
>


-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list