[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Thu Aug 1 10:03:04 MDT 2013


There have been two fixes for this issue:

1. Add more logging and checking to verify that the mother superior is
rejecting the specified job. This fix went into 4.1.6/4.2.3 and resolved
the problem for most users that reported it.
2. Have pbs_server remember when the mother superior has reported on the
job and not abort for this reason if mother superior has reported the job
to pbs_server in the last 180 seconds. This fix has been released with
4.2.4 and will be released with 4.1.7. Of the users I know of that were
still experiencing this defect after 4.1.6 they are no longer experiencing
it with this change in place.

David


On Thu, Aug 1, 2013 at 6:33 AM, Kenneth Hoste <kenneth.hoste at ugent.be>wrote:

> Was this problem ever resolved?
>
> I noticed through
> http://www.clusterresources.com/pipermail/torqueusers/2012-December/015352.htmlthat David looked into this, but the archive doesn't show any further
> followup.
>
> It seems we're currently suffering from a very similar problem with the
> Torque v4.1.6...
>
>
> regards,
>
> Kenneth
>
> PS: Please keep me in CC when replying, for some reason I'm no longer
> receiving mails from torqueusers@ even though I'm subscribed...
>
>
> On 22 Nov 2012, at 16:46, Lech Nieroda wrote:
>
> > Dear list,
> >
> > we have another serious problem since our upgrade to Torque 4.1.3. We
> > are using it with Maui 3.3.1. The problem in a nutshell: some few,
> > random jobs are suddenly "unknown" to the server, it changes their
> > status to EXITING-SUBSTATE55 and requests a silent kill on the compute
> > nodes. The job then dies, the processes are killed on the node, there is
> > no "Exit_status" in the server-log, no entry in maui/stats, no
> > stdout/stderr files. The users are, understandably, not amused.
> >
> > It doesn't seem to be user or application specific. Even a single
> > instance from a job array can get blown away in this way while all other
> > instances end normally.
> >
> > Here are some logs of such a job (681684[35]):
> >
> > maui just assumes a successful completion:
> > [snip]
> > 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> > 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
> > 1.000000, 1
> > 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from
> > the queue.  assuming successful completion
> > 11/21 19:25:55 MJobProcessCompleted(681684[35])
> > 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
> > 10903  PS: 10903  A: 0.063096
> > 11/21 19:25:55 MJobSendFB(681684[35])
> > 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> > 11/21 19:25:55 MJobRemove(681684[35])
> > 11/21 19:25:55 MJobDestroy(681684[35])
> > [snap]
> >
> > pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
> > unknown (grepped by JobID from the server logs):
> > [snip]
> > 11/21/2012
> > 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> > RUNNING-RUNNING (4-42)
> > 11/21/2012
> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> > QUEUED-SUBSTATE55 (1-55)
> > 11/21/2012
> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> > EXITING-SUBSTATE55 (5-55)
> > 11/21/2012
> > 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
> > smp, state EXITING
> > 11/21/2012
> >
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> stray
> > job 681684[35].cheops10 found on cheops21316
> > [snap]
> >
> > pbs_client just kills the processes:
> > [snip]
> > 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
> > 681684[35].cheops10 started, pid = 17452
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17452/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17452/state=Z) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17692/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17703/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17731/state=R) after sig 15
> > 11/21/2012 19:25:15;0080;
> > pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> > 681684[35].cheops10 task 1 terminated, sid=17452
> > 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
> > was terminated
> > 11/21/2012 19:25:50;0001;
> > pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
> > deleting locally
> > 11/21/2012 19:25:50;0080;
> > pbs_mom.32254;Job;681684[35].cheops10;removed job script
> > [snap]
> >
> > Sometimes, the pbs_mom logs include this message before the killing
> starts:
> > [snip]
> > Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
> > type=StatusJob, from PBS_Server at cheops10
> > [snap]
> >
> > And finally, some job informations given to epilogue:
> > [snip]
> > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> > userid=hthiele0,
> >
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> >
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> > queue=smp, account=ccg-ngs, exitcode=271
> > [snap]
> >
> > This happens rarely (about 1 in 3000). However, silent deletions of
> > random jobs aren't exactly a trifling matter.
> > I could try to disable the mom_job_sync option, which could perhaps
> > prevent the process killing of unknown jobs, but it would also leave
> > corrupt/pre-execution jobs alive.
> >
> > Can this be fixed?
> >
> > On a side-note, here are some further, minor Bugs I've noticed in the
> > Torque 4.1.3. Version:
> > - the epilogue script is usually invoked twice and sometimes even
> > several times
> > - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2,
> > then the number of "tasks" as seen by qstat is zero
> > - there have been some API changes between Torque 2.x and Torque 4.x, so
> > that two maui calls had to be altered in order to build against Torque
> > 4.x (get_svrport, openrm).
> >
> >
> > Regards,
> > Lech Nieroda
> >
> > --
> > Dipl.-Wirt.-Inf. Lech Nieroda
> > Regionales Rechenzentrum der Universität zu Köln (RRZK)
> > Universität zu Köln
> > Weyertal 121
> > Raum 309 (3. Etage)
> > D-50931 Köln
> > Deutschland
> >
> > Tel.: +49 (221) 470-89606
> > E-Mail: nieroda.lech at uni-koeln.de
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130801/0d7893bf/attachment-0001.html 


More information about the torqueusers mailing list