[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Wed Dec 12 10:16:58 MST 2012


Lech,

>From your log snippet, it appears that the server sent a poll request to
the mom (for the job) and the mom replied back that it didn't know about
that job, getting the job cancelled on the server's side. I'm going to do a
little more investigation to see if there's some way that the poll request
to the mom can time out or otherwise fail to erroneously create this
condition. If you can provide any more details about creating this
scenario, that'd be great. Helpful information would be describing the
lengths of jobs that are running, if they are parallel or serial, or
estimate the ratio of parallel to serial, things like that might help us
figure it out.

David

On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:

> Dear list,
>
> the problem continues to persist: ca. 1 every 1000-2000 jobs is killed.
> We've noticed that it almost disappears when the pbs_server loglevel is
> raised to 7 (ca. one occurrence of the error per week). Considering that
> higher loglevels simply generate more output and may thus prolong the
> execution of some tasks for a minuscule amount of time, this may turn
> out to be a race condition issue.
> We can't permanently set the loglevel to 7 though, because of two
> problems: the daily logfile size is ca. 30GB and pbs_server crashes
> every 2-3 days.
>
> The error: initially, the server sends an abort for no apparent reason,
> however it mails the user "Job does not exist on node". Even though the
> pbs_mom sends the correct joblist to the pbs_server, it receives a
> SIGKILL signal for the job (here with the jobid 721711, no dependencies,
> etc) without an error message of its own:
>
> [pbs_server snip]
> 12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting
> connect to host 172.18.1.220 port 15002
> 12/12/2012
> 02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate:
> setting job 721711.cheops10 state from RUNNING-RUNNING to
> QUEUED-SUBSTATE55 (1-55)
> [...]
> 12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing
> to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does
> not exist on node)
> [pbs_server snap]
>
> [pbs_mom snip]
> 12/12/2012 02:49:50;0002;
> pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending
> to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10
> 722458.cheops10"
> 12/12/2012 02:49:50;0002;
> pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending
> to server "varattr= "
> 12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;read_tcp_reply;protocol: 4
>   version: 3  command:4  sock:8
> 12/12/2012 02:49:50;0002;
> pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent
> to cheops10
> 12/12/2012 02:49:50;0080;   pbs_mom.31686;Req;dis_request_read;decoding
> command SignalJob from PBS_Server
> 12/12/2012 02:49:50;0100;   pbs_mom.31686;Req;;Type SignalJob request
> received from PBS_Server at cheops10, sock=8
> 12/12/2012 02:49:50;0008;
> pbs_mom.31686;Job;mom_process_request;request type SignalJob from host
> cheops10 received
> 12/12/2012 02:49:50;0008;
> pbs_mom.31686;Job;mom_process_request;request type SignalJob from host
> cheops10 allowed
> 12/12/2012 02:49:50;0008;
> pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob on
> sd=8
> 12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;721711.cheops10;signaling
> job with signal SIGKILL
> 12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;kill_job;req_signaljob:
> sending signal 9, "KILL" to job 721711.cheops10, reason: killing job
> [...]
> 12/12/2012 02:50:35;0080;   pbs_mom.31686;Svr;preobit_reply;top of
> preobit_reply
> 12/12/2012 02:50:35;0080;
> pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr
> worked, top of while loop
> 12/12/2012 02:50:35;0001;
> pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server,
> deleting locally
> [pbs_mom snap]
>
> Ideas? Fixes?
>
> Regards,
> Lech Nieroda
>
>
>
> In case it helps to analyze the bug
>
> On 26.11.2012 14:02, Lech Nieroda wrote:
> > Dear list,
> >
> > I've raised the loglevel on one of the clients to loglevel=7 in order to
> > collect more information on the event.
> >
> > maui just assumes a successful completion of the job.
> >
> > pbs_server log regarding the job 686938[4] that suddenly becomes unknown
> > at 00:15:21:
> > [snip]
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at request
> > of maui at localhost.localdomain
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> > setting job 686938[4].cheops10 state from QUEUED-QUEUED to
> > RUNNING-PRERUN (4-40)
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported
> > success for job after 1 seconds (dest=???), rc=0
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to
> > RUNNING-RUNNING (4-42)
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending email:
> > User does not want mail of this type.
> > /var/spool/torque/server_logs/20121125:11/25/2012
> > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> > setting job 686938[4].cheops10 state from RUNNING-RUNNING to
> > QUEUED-SUBSTATE55 (1-55)
> > /var/spool/torque/server_logs/20121126:11/26/2012
> > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to
> > EXITING-SUBSTATE55 (5-55)
> > /var/spool/torque/server_logs/20121126:11/26/2012
> > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from
> > smp, state EXITING
> > /var/spool/torque/server_logs/20121126:11/26/2012
> >
> 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> stray
> > job 686938[4].cheops10 found on cheops11801
> > [snap]
> >
> > pbs_mom log at loglevel 7, shortly before the kill:
> > [snip]
> > 11/26/2012 00:14:26;0008;   pbs_mom.26562;Req;send_sisters;sending
> > command POLL_JOB for job 686938[4].cheops10 (7)
> > 11/26/2012 00:15:02;0002;
> > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending
> > to server "jobs=686659[1].cheops10 686694[8].cheops10
> > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10
> > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10
> > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10
> > 686941[212].cheops10 687103[1].cheops10 687224.cheops10
> > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10
> > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10
> > 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
> > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:11;0008;   pbs_mom.26562;Req;send_sisters;sending
> > command POLL_JOB for job 686938[4].cheops10 (7)
> > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:15:56;0008;   pbs_mom.26562;Req;send_sisters;sending
> > command POLL_JOB for job 686938[4].cheops10 (7)
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:00;0008;
> > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in
> > non-exiting substate RUNNING, no obit sent at this time
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array loop
> > start - jobid = 686938[4].cheops10
> > 11/26/2012 00:16:01;0080;
> > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0 (child
> > pid=4214)
> > 11/26/2012 00:16:02;0002;
> > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat: sending
> > to server "jobs=686659[1].cheops10 686694[8].cheops10
> > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10
> > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10
> > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10
> > 687103[1].cheops10 687224.cheops10 687252[1].cheops10 687279[1].cheops10
> > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10
> > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
> > 11/26/2012 00:16:02;0008;
> > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal SIGKILL
> > 11/26/2012 00:16:02;0008;   pbs_mom.26562;Job;kill_job;req_signaljob:
> > sending signal 9, "KILL" to job 686938[4].cheops10, reason: killing job
> > [snap]
> >
> > As we can see, the pbs_mom sends complete job lists to the pbs_server
> > right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting
> > (00:15:21), where the job 686938[4].cheops10 is included. However, the
> > server claims that the job is unknown/doesn't exist and sends a kill
> > command which the pbs_mom then executes (0:16:02).
> >
> > I hope this helps in tracking down the bug.
> >
> > Regards,
> > Lech Nieroda
> >
> >
> > On 22.11.2012 16:46, Lech Nieroda wrote:
> >> Dear list,
> >>
> >> we have another serious problem since our upgrade to Torque 4.1.3. We
> >> are using it with Maui 3.3.1. The problem in a nutshell: some few,
> >> random jobs are suddenly "unknown" to the server, it changes their
> >> status to EXITING-SUBSTATE55 and requests a silent kill on the compute
> >> nodes. The job then dies, the processes are killed on the node, there is
> >> no "Exit_status" in the server-log, no entry in maui/stats, no
> >> stdout/stderr files. The users are, understandably, not amused.
> >>
> >> It doesn't seem to be user or application specific. Even a single
> >> instance from a job array can get blown away in this way while all other
> >> instances end normally.
> >>
> >> Here are some logs of such a job (681684[35]):
> >>
> >> maui just assumes a successful completion:
> >> [snip]
> >> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
> >> 1.000000, 1
> >> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from
> >> the queue.  assuming successful completion
> >> 11/21 19:25:55 MJobProcessCompleted(681684[35])
> >> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
> >> 10903  PS: 10903  A: 0.063096
> >> 11/21 19:25:55 MJobSendFB(681684[35])
> >> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> >> 11/21 19:25:55 MJobRemove(681684[35])
> >> 11/21 19:25:55 MJobDestroy(681684[35])
> >> [snap]
> >>
> >> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
> >> unknown (grepped by JobID from the server logs):
> >> [snip]
> >> 11/21/2012
> >> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> >> RUNNING-RUNNING (4-42)
> >> 11/21/2012
> >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> >> QUEUED-SUBSTATE55 (1-55)
> >> 11/21/2012
> >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> >> EXITING-SUBSTATE55 (5-55)
> >> 11/21/2012
> >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
> >> smp, state EXITING
> >> 11/21/2012
> >>
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> stray
> >> job 681684[35].cheops10 found on cheops21316
> >> [snap]
> >>
> >> pbs_client just kills the processes:
> >> [snip]
> >> 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
> >> 681684[35].cheops10 started, pid = 17452
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
> >> 1 gracefully with sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >> (pid=17452/state=R) after sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >> (pid=17452/state=Z) after sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
> >> 1 gracefully with sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >> (pid=17692/state=R) after sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
> >> 1 gracefully with sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >> (pid=17703/state=R) after sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
> >> 1 gracefully with sig 15
> >> 11/21/2012 19:25:14;0008;
> >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >> (pid=17731/state=R) after sig 15
> >> 11/21/2012 19:25:15;0080;
> >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> >> 681684[35].cheops10 task 1 terminated, sid=17452
> >> 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
> >> was terminated
> >> 11/21/2012 19:25:50;0001;
> >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
> >> deleting locally
> >> 11/21/2012 19:25:50;0080;
> >> pbs_mom.32254;Job;681684[35].cheops10;removed job script
> >> [snap]
> >>
> >> Sometimes, the pbs_mom logs include this message before the killing
> starts:
> >> [snip]
> >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
> >> type=StatusJob, from PBS_Server at cheops10
> >> [snap]
> >>
> >> And finally, some job informations given to epilogue:
> >> [snip]
> >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> >> 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> >> userid=hthiele0,
> >>
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> >>
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> >> queue=smp, account=ccg-ngs, exitcode=271
> >> [snap]
> >>
> >> This happens rarely (about 1 in 3000). However, silent deletions of
> >> random jobs aren't exactly a trifling matter.
> >> I could try to disable the mom_job_sync option, which could perhaps
> >> prevent the process killing of unknown jobs, but it would also leave
> >> corrupt/pre-execution jobs alive.
> >>
> >> Can this be fixed?
> >>
> >> On a side-note, here are some further, minor Bugs I've noticed in the
> >> Torque 4.1.3. Version:
> >> - the epilogue script is usually invoked twice and sometimes even
> >> several times
> >> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2,
> >> then the number of "tasks" as seen by qstat is zero
> >> - there have been some API changes between Torque 2.x and Torque 4.x, so
> >> that two maui calls had to be altered in order to build against Torque
> >> 4.x (get_svrport, openrm).
> >>
> >>
> >> Regards,
> >> Lech Nieroda
> >>
> >
> >
>
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
>
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121212/328e74e9/attachment-0001.html 


More information about the torqueusers mailing list