[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Mon Sep 23 09:48:44 MDT 2013


Kenneth,

4.1.7 is set for release on Wednesday, but 4.2.5 is out now and has both of
these fixes.

David


On Mon, Sep 23, 2013 at 9:34 AM, Kenneth Hoste <kenneth.hoste at ugent.be>wrote:

> Hello,
>
> We just got bit by this problem again, i.e. the MOM killing one of the
> processes of the job without any reason for it (nowhere need walltime or
> memory limits, MOM or server were not restarted at that point, ...).
>
> I can't find any trace of a possible reason in the pbs_server logs, and
> the MOM is just reporting this:
>
> > 09/23/2013 15:50:48;0002;   pbs_mom.3350;Svr;pbs_mom;Torque Mom Version
> = 4.1.6, loglevel = 0
> > 09/23/2013 15:55:48;0002;   pbs_mom.3350;Svr;pbs_mom;Torque Mom Version
> = 4.1.6, loglevel = 0
> > 09/23/2013 15:58:37;0008;
> pbs_mom.3350;Job;112871.master9.x.y.z;kill_task: killing pid 49500 task 1
> gracefully with sig 15
> > 09/23/2013 15:58:37;0008;
> pbs_mom.3350;Job;112871.master9.x.y.z;kill_task: process
> (pid=49500/state=R) after sig 15
> > 09/23/2013 15:58:37;0080;
> pbs_mom.3350;Job;112871.master9.x.y.z;scan_for_terminated: job
> 112871.master9.x.y.z task 1 terminated, sid=49303
> > 09/23/2013 15:58:37;0008;   pbs_mom.3350;Job;112871.master9.x.y.z;job
> was terminated
>
>
> In a previous mail (Aug 2nd 2013), I polled for an estimated date for
> Torque 4.1.7 that includes a bug fix for this, but didn't get a reply.
>
> It also strikes me as **very** surprising that there hasn't been a new
> release for neither Torque 4.1.x nor Torque 2.5.x
> that includes the fix for the (quite serious) security issue for which an
> advisory was sent out on Sept 6th 2013.
>
> So, is there an ETA for Torque 4.1.7, that would include a fix for this
> (and also for the security issue)?
>
> If not, can anyone please point out the commit ID for the additional fix
> that was added between 4.1.6 and the (future) 4.1.7?
>
> Please keep me/us in CC when replying, I'm still having issues with
> receiving mail from this list,
> even though I'm subscribed to it (can someone check whether I've been
> blacklisted or somesuch?).
>
>
> regards,
>
> Kenneth
>
>
> On 01 Aug 2013, at 18:03, David Beer wrote:
>
> > There have been two fixes for this issue:
> >
> > 1. Add more logging and checking to verify that the mother superior is
> rejecting the specified job. This fix went into 4.1.6/4.2.3 and resolved
> the problem for most users that reported it.
> > 2. Have pbs_server remember when the mother superior has reported on the
> job and not abort for this reason if mother superior has reported the job
> to pbs_server in the last 180 seconds. This fix has been released with
> 4.2.4 and will be released with 4.1.7. Of the users I know of that were
> still experiencing this defect after 4.1.6 they are no longer experiencing
> it with this change in place.
> >
> > David
> >
> >
> > On Thu, Aug 1, 2013 at 6:33 AM, Kenneth Hoste <kenneth.hoste at ugent.be>
> wrote:
> > Was this problem ever resolved?
> >
> > I noticed through
> http://www.clusterresources.com/pipermail/torqueusers/2012-December/015352.htmlthat David looked into this, but the archive doesn't show any further
> followup.
> >
> > It seems we're currently suffering from a very similar problem with the
> Torque v4.1.6...
> >
> >
> > regards,
> >
> > Kenneth
> >
> > PS: Please keep me in CC when replying, for some reason I'm no longer
> receiving mails from torqueusers@ even though I'm subscribed...
> >
> >
> > On 22 Nov 2012, at 16:46, Lech Nieroda wrote:
> >
> > > Dear list,
> > >
> > > we have another serious problem since our upgrade to Torque 4.1.3. We
> > > are using it with Maui 3.3.1. The problem in a nutshell: some few,
> > > random jobs are suddenly "unknown" to the server, it changes their
> > > status to EXITING-SUBSTATE55 and requests a silent kill on the compute
> > > nodes. The job then dies, the processes are killed on the node, there
> is
> > > no "Exit_status" in the server-log, no entry in maui/stats, no
> > > stdout/stderr files. The users are, understandably, not amused.
> > >
> > > It doesn't seem to be user or application specific. Even a single
> > > instance from a job array can get blown away in this way while all
> other
> > > instances end normally.
> > >
> > > Here are some logs of such a job (681684[35]):
> > >
> > > maui just assumes a successful completion:
> > > [snip]
> > > 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> > > 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000,
> > > 1.000000, 1
> > > 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed
> from
> > > the queue.  assuming successful completion
> > > 11/21 19:25:55 MJobProcessCompleted(681684[35])
> > > 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
> > > 10903  PS: 10903  A: 0.063096
> > > 11/21 19:25:55 MJobSendFB(681684[35])
> > > 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> > > 11/21 19:25:55 MJobRemove(681684[35])
> > > 11/21 19:25:55 MJobDestroy(681684[35])
> > > [snap]
> > >
> > > pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
> > > unknown (grepped by JobID from the server logs):
> > > [snip]
> > > 11/21/2012
> > > 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> > > setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> > > RUNNING-RUNNING (4-42)
> > > 11/21/2012
> > > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > > setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> > > QUEUED-SUBSTATE55 (1-55)
> > > 11/21/2012
> > > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> > > setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> > > EXITING-SUBSTATE55 (5-55)
> > > 11/21/2012
> > > 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
> > > smp, state EXITING
> > > 11/21/2012
> > >
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> stray
> > > job 681684[35].cheops10 found on cheops21316
> > > [snap]
> > >
> > > pbs_client just kills the processes:
> > > [snip]
> > > 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
> > > 681684[35].cheops10 started, pid = 17452
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task
> > > 1 gracefully with sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > > (pid=17452/state=R) after sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > > (pid=17452/state=Z) after sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task
> > > 1 gracefully with sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > > (pid=17692/state=R) after sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task
> > > 1 gracefully with sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > > (pid=17703/state=R) after sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task
> > > 1 gracefully with sig 15
> > > 11/21/2012 19:25:14;0008;
> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> > > (pid=17731/state=R) after sig 15
> > > 11/21/2012 19:25:15;0080;
> > > pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> > > 681684[35].cheops10 task 1 terminated, sid=17452
> > > 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
> > > was terminated
> > > 11/21/2012 19:25:50;0001;
> > > pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server,
> > > deleting locally
> > > 11/21/2012 19:25:50;0080;
> > > pbs_mom.32254;Job;681684[35].cheops10;removed job script
> > > [snap]
> > >
> > > Sometimes, the pbs_mom logs include this message before the killing
> starts:
> > > [snip]
> > > Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
> > > type=StatusJob, from PBS_Server at cheops10
> > > [snap]
> > >
> > > And finally, some job informations given to epilogue:
> > > [snip]
> > > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > > 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> > > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> > > 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> > > userid=hthiele0,
> > >
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> > >
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> > > queue=smp, account=ccg-ngs, exitcode=271
> > > [snap]
> > >
> > > This happens rarely (about 1 in 3000). However, silent deletions of
> > > random jobs aren't exactly a trifling matter.
> > > I could try to disable the mom_job_sync option, which could perhaps
> > > prevent the process killing of unknown jobs, but it would also leave
> > > corrupt/pre-execution jobs alive.
> > >
> > > Can this be fixed?
> > >
> > > On a side-note, here are some further, minor Bugs I've noticed in the
> > > Torque 4.1.3. Version:
> > > - the epilogue script is usually invoked twice and sometimes even
> > > several times
> > > - when explicit node lists are used, e.g.
> nodes=node1:ppn=2+node2:ppn=2,
> > > then the number of "tasks" as seen by qstat is zero
> > > - there have been some API changes between Torque 2.x and Torque 4.x,
> so
> > > that two maui calls had to be altered in order to build against Torque
> > > 4.x (get_svrport, openrm).
> > >
> > >
> > > Regards,
> > > Lech Nieroda
> > >
> > > --
> > > Dipl.-Wirt.-Inf. Lech Nieroda
> > > Regionales Rechenzentrum der Universität zu Köln (RRZK)
> > > Universität zu Köln
> > > Weyertal 121
> > > Raum 309 (3. Etage)
> > > D-50931 Köln
> > > Deutschland
> > >
> > > Tel.: +49 (221) 470-89606
> > > E-Mail: nieroda.lech at uni-koeln.de
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > --
> > David Beer | Senior Software Engineer
> > Adaptive Computing
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130923/18d3c751/attachment-0001.html 


More information about the torqueusers mailing list