[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Kenneth Hoste kenneth.hoste at ugent.be
Fri Aug 2 08:49:51 MDT 2013


Hi David,

On 01 Aug 2013, at 18:03, David Beer wrote:

> There have been two fixes for this issue: 
> 
> 1. Add more logging and checking to verify that the mother superior is rejecting the specified job. This fix went into 4.1.6/4.2.3 and resolved the problem for most users that reported it.
> 2. Have pbs_server remember when the mother superior has reported on the job and not abort for this reason if mother superior has reported the job to pbs_server in the last 180 seconds. This fix has been released with 4.2.4 and will be released with 4.1.7. Of the users I know of that were still experiencing this defect after 4.1.6 they are no longer experiencing it with this change in place.

Is there any ETA for Torque v4.1.7?

Can you point me to a commit ID for the fix, so we can check whether it also solves the issue on our end?


regards,

Kenneth

> 
> David
> 
> 
> On Thu, Aug 1, 2013 at 6:33 AM, Kenneth Hoste <kenneth.hoste at ugent.be> wrote:
> Was this problem ever resolved?
> 
> I noticed through http://www.clusterresources.com/pipermail/torqueusers/2012-December/015352.html that David looked into this, but the archive doesn't show any further followup.
> 
> It seems we're currently suffering from a very similar problem with the Torque v4.1.6...
> 
> 
> regards,
> 
> Kenneth
> 
> PS: Please keep me in CC when replying, for some reason I'm no longer receiving mails from torqueusers@ even though I'm subscribed...
> 
> 
> On 22 Nov 2012, at 16:46, Lech Nieroda wrote:
> 
> > Dear list,
> >
> > we have another serious problem since our upgrade to Torque 4.1.3. We
> > are using it with Maui 3.3.1. The problem in a nutshell: some few,
> > random jobs are suddenly "unknown" to the server, it changes their
> > status to EXITING-SUBSTATE55 and requests a silent kill on the compute
> > nodes. The job then dies, the processes are killed on the node, there is
> > no "Exit_status" in the server-log, no entry in maui/stats, no
> > stdout/stderr files. The users are, understandably, not amused.
> >
> > It doesn't seem to be user or application specific. Even a single
> > instance from a job array can get blown away in this way while all other
> > instances end normally.
> >
> > Here are some logs of such a job (681684[35]):
> >
> > maui just assumes a successful completion:
> > [snip]
> > 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> > 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
> > 1.000000, 1
> > 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from
> > the queue.  assuming successful completion
> > 11/21 19:25:55 MJobProcessCompleted(681684[35])
> > 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
> > 10903  PS: 10903  A: 0.063096
> > 11/21 19:25:55 MJobSendFB(681684[35])
> > 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> > 11/21 19:25:55 MJobRemove(681684[35])
> > 11/21 19:25:55 MJobDestroy(681684[35])
> > [snap]
> >
> > pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
> > unknown (grepped by JobID from the server logs):
> > [snip]
> > 11/21/2012
> > 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> > RUNNING-RUNNING (4-42)
> > 11/21/2012
> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> > QUEUED-SUBSTATE55 (1-55)
> > 11/21/2012
> > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> > EXITING-SUBSTATE55 (5-55)
> > 11/21/2012
> > 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
> > smp, state EXITING
> > 11/21/2012
> > 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray
> > job 681684[35].cheops10 found on cheops21316
> > [snap]
> >
> > pbs_client just kills the processes:
> > [snip]
> > 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
> > 681684[35].cheops10 started, pid = 17452
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17452/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17452/state=Z) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17692/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17703/state=R) after sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
> > 1 gracefully with sig 15
> > 11/21/2012 19:25:14;0008;
> > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > (pid=17731/state=R) after sig 15
> > 11/21/2012 19:25:15;0080;
> > pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> > 681684[35].cheops10 task 1 terminated, sid=17452
> > 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
> > was terminated
> > 11/21/2012 19:25:50;0001;
> > pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
> > deleting locally
> > 11/21/2012 19:25:50;0080;
> > pbs_mom.32254;Job;681684[35].cheops10;removed job script
> > [snap]
> >
> > Sometimes, the pbs_mom logs include this message before the killing starts:
> > [snip]
> > Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
> > type=StatusJob, from PBS_Server at cheops10
> > [snap]
> >
> > And finally, some job informations given to epilogue:
> > [snip]
> > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> > userid=hthiele0,
> > resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> > resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> > queue=smp, account=ccg-ngs, exitcode=271
> > [snap]
> >
> > This happens rarely (about 1 in 3000). However, silent deletions of
> > random jobs aren't exactly a trifling matter.
> > I could try to disable the mom_job_sync option, which could perhaps
> > prevent the process killing of unknown jobs, but it would also leave
> > corrupt/pre-execution jobs alive.
> >
> > Can this be fixed?
> >
> > On a side-note, here are some further, minor Bugs I've noticed in the
> > Torque 4.1.3. Version:
> > - the epilogue script is usually invoked twice and sometimes even
> > several times
> > - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2,
> > then the number of "tasks" as seen by qstat is zero
> > - there have been some API changes between Torque 2.x and Torque 4.x, so
> > that two maui calls had to be altered in order to build against Torque
> > 4.x (get_svrport, openrm).
> >
> >
> > Regards,
> > Lech Nieroda
> >
> > --
> > Dipl.-Wirt.-Inf. Lech Nieroda
> > Regionales Rechenzentrum der Universität zu Köln (RRZK)
> > Universität zu Köln
> > Weyertal 121
> > Raum 309 (3. Etage)
> > D-50931 Köln
> > Deutschland
> >
> > Tel.: +49 (221) 470-89606
> > E-Mail: nieroda.lech at uni-koeln.de
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing



More information about the torqueusers mailing list