[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Mon Sep 23 09:49:29 MDT 2013


I do apologize about not responding to the previous email, it appears to
have just gotten lost in the shuffle. Thanks for pinging me again.


On Mon, Sep 23, 2013 at 9:48 AM, David Beer <dbeer at adaptivecomputing.com>wrote:

> Kenneth,
>
> 4.1.7 is set for release on Wednesday, but 4.2.5 is out now and has both
> of these fixes.
>
> David
>
>
> On Mon, Sep 23, 2013 at 9:34 AM, Kenneth Hoste <kenneth.hoste at ugent.be>wrote:
>
>> Hello,
>>
>> We just got bit by this problem again, i.e. the MOM killing one of the
>> processes of the job without any reason for it (nowhere need walltime or
>> memory limits, MOM or server were not restarted at that point, ...).
>>
>> I can't find any trace of a possible reason in the pbs_server logs, and
>> the MOM is just reporting this:
>>
>> > 09/23/2013 15:50:48;0002;   pbs_mom.3350;Svr;pbs_mom;Torque Mom Version
>> = 4.1.6, loglevel = 0
>> > 09/23/2013 15:55:48;0002;   pbs_mom.3350;Svr;pbs_mom;Torque Mom Version
>> = 4.1.6, loglevel = 0
>> > 09/23/2013 15:58:37;0008;
>> pbs_mom.3350;Job;112871.master9.x.y.z;kill_task: killing pid 49500 task 1
>> gracefully with sig 15
>> > 09/23/2013 15:58:37;0008;
>> pbs_mom.3350;Job;112871.master9.x.y.z;kill_task: process
>> (pid=49500/state=R) after sig 15
>> > 09/23/2013 15:58:37;0080;
>> pbs_mom.3350;Job;112871.master9.x.y.z;scan_for_terminated: job
>> 112871.master9.x.y.z task 1 terminated, sid=49303
>> > 09/23/2013 15:58:37;0008;   pbs_mom.3350;Job;112871.master9.x.y.z;job
>> was terminated
>>
>>
>> In a previous mail (Aug 2nd 2013), I polled for an estimated date for
>> Torque 4.1.7 that includes a bug fix for this, but didn't get a reply.
>>
>> It also strikes me as **very** surprising that there hasn't been a new
>> release for neither Torque 4.1.x nor Torque 2.5.x
>> that includes the fix for the (quite serious) security issue for which an
>> advisory was sent out on Sept 6th 2013.
>>
>> So, is there an ETA for Torque 4.1.7, that would include a fix for this
>> (and also for the security issue)?
>>
>> If not, can anyone please point out the commit ID for the additional fix
>> that was added between 4.1.6 and the (future) 4.1.7?
>>
>> Please keep me/us in CC when replying, I'm still having issues with
>> receiving mail from this list,
>> even though I'm subscribed to it (can someone check whether I've been
>> blacklisted or somesuch?).
>>
>>
>> regards,
>>
>> Kenneth
>>
>>
>> On 01 Aug 2013, at 18:03, David Beer wrote:
>>
>> > There have been two fixes for this issue:
>> >
>> > 1. Add more logging and checking to verify that the mother superior is
>> rejecting the specified job. This fix went into 4.1.6/4.2.3 and resolved
>> the problem for most users that reported it.
>> > 2. Have pbs_server remember when the mother superior has reported on
>> the job and not abort for this reason if mother superior has reported the
>> job to pbs_server in the last 180 seconds. This fix has been released with
>> 4.2.4 and will be released with 4.1.7. Of the users I know of that were
>> still experiencing this defect after 4.1.6 they are no longer experiencing
>> it with this change in place.
>> >
>> > David
>> >
>> >
>> > On Thu, Aug 1, 2013 at 6:33 AM, Kenneth Hoste <kenneth.hoste at ugent.be>
>> wrote:
>> > Was this problem ever resolved?
>> >
>> > I noticed through
>> http://www.clusterresources.com/pipermail/torqueusers/2012-December/015352.htmlthat David looked into this, but the archive doesn't show any further
>> followup.
>> >
>> > It seems we're currently suffering from a very similar problem with the
>> Torque v4.1.6...
>> >
>> >
>> > regards,
>> >
>> > Kenneth
>> >
>> > PS: Please keep me in CC when replying, for some reason I'm no longer
>> receiving mails from torqueusers@ even though I'm subscribed...
>> >
>> >
>> > On 22 Nov 2012, at 16:46, Lech Nieroda wrote:
>> >
>> > > Dear list,
>> > >
>> > > we have another serious problem since our upgrade to Torque 4.1.3. We
>> > > are using it with Maui 3.3.1. The problem in a nutshell: some few,
>> > > random jobs are suddenly "unknown" to the server, it changes their
>> > > status to EXITING-SUBSTATE55 and requests a silent kill on the compute
>> > > nodes. The job then dies, the processes are killed on the node, there
>> is
>> > > no "Exit_status" in the server-log, no entry in maui/stats, no
>> > > stdout/stderr files. The users are, understandably, not amused.
>> > >
>> > > It doesn't seem to be user or application specific. Even a single
>> > > instance from a job array can get blown away in this way while all
>> other
>> > > instances end normally.
>> > >
>> > > Here are some logs of such a job (681684[35]):
>> > >
>> > > maui just assumes a successful completion:
>> > > [snip]
>> > > 11/21 19:24:49
>> MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
>> > > 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is
>>  1.000000,
>> > > 1.000000, 1
>> > > 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed
>> from
>> > > the queue.  assuming successful completion
>> > > 11/21 19:25:55 MJobProcessCompleted(681684[35])
>> > > 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
>> > > 10903  PS: 10903  A: 0.063096
>> > > 11/21 19:25:55 MJobSendFB(681684[35])
>> > > 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
>> > > 11/21 19:25:55 MJobRemove(681684[35])
>> > > 11/21 19:25:55 MJobDestroy(681684[35])
>> > > [snap]
>> > >
>> > > pbs_server decides at 19:25:11, after 3 hours runtime, that the job is
>> > > unknown (grepped by JobID from the server logs):
>> > > [snip]
>> > > 11/21/2012
>> > > 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
>> > > setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
>> > > RUNNING-RUNNING (4-42)
>> > > 11/21/2012
>> > > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>> > > setting job 681684[35].cheops10 state from RUNNING-RUNNING to
>> > > QUEUED-SUBSTATE55 (1-55)
>> > > 11/21/2012
>> > > 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>> > > setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
>> > > EXITING-SUBSTATE55 (5-55)
>> > > 11/21/2012
>> > > 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from
>> > > smp, state EXITING
>> > > 11/21/2012
>> > >
>> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
>> stray
>> > > job 681684[35].cheops10 found on cheops21316
>> > > [snap]
>> > >
>> > > pbs_client just kills the processes:
>> > > [snip]
>> > > 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
>> > > 681684[35].cheops10 started, pid = 17452
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452
>> task
>> > > 1 gracefully with sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> > > (pid=17452/state=R) after sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> > > (pid=17452/state=Z) after sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692
>> task
>> > > 1 gracefully with sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> > > (pid=17692/state=R) after sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703
>> task
>> > > 1 gracefully with sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> > > (pid=17703/state=R) after sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731
>> task
>> > > 1 gracefully with sig 15
>> > > 11/21/2012 19:25:14;0008;
>> > > pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>> > > (pid=17731/state=R) after sig 15
>> > > 11/21/2012 19:25:15;0080;
>> > > pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
>> > > 681684[35].cheops10 task 1 terminated, sid=17452
>> > > 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job
>> > > was terminated
>> > > 11/21/2012 19:25:50;0001;
>> > > pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on
>> server,
>> > > deleting locally
>> > > 11/21/2012 19:25:50;0080;
>> > > pbs_mom.32254;Job;681684[35].cheops10;removed job script
>> > > [snap]
>> > >
>> > > Sometimes, the pbs_mom logs include this message before the killing
>> starts:
>> > > [snip]
>> > > Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
>> > > type=StatusJob, from PBS_Server at cheops10
>> > > [snap]
>> > >
>> > > And finally, some job informations given to epilogue:
>> > > [snip]
>> > > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>> > > 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
>> > > Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>> > > 681684[35].cheops10,hthiele0,cheops21316,Job Information:
>> > > userid=hthiele0,
>> > >
>> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
>> > >
>> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
>> > > queue=smp, account=ccg-ngs, exitcode=271
>> > > [snap]
>> > >
>> > > This happens rarely (about 1 in 3000). However, silent deletions of
>> > > random jobs aren't exactly a trifling matter.
>> > > I could try to disable the mom_job_sync option, which could perhaps
>> > > prevent the process killing of unknown jobs, but it would also leave
>> > > corrupt/pre-execution jobs alive.
>> > >
>> > > Can this be fixed?
>> > >
>> > > On a side-note, here are some further, minor Bugs I've noticed in the
>> > > Torque 4.1.3. Version:
>> > > - the epilogue script is usually invoked twice and sometimes even
>> > > several times
>> > > - when explicit node lists are used, e.g.
>> nodes=node1:ppn=2+node2:ppn=2,
>> > > then the number of "tasks" as seen by qstat is zero
>> > > - there have been some API changes between Torque 2.x and Torque 4.x,
>> so
>> > > that two maui calls had to be altered in order to build against Torque
>> > > 4.x (get_svrport, openrm).
>> > >
>> > >
>> > > Regards,
>> > > Lech Nieroda
>> > >
>> > > --
>> > > Dipl.-Wirt.-Inf. Lech Nieroda
>> > > Regionales Rechenzentrum der Universität zu Köln (RRZK)
>> > > Universität zu Köln
>> > > Weyertal 121
>> > > Raum 309 (3. Etage)
>> > > D-50931 Köln
>> > > Deutschland
>> > >
>> > > Tel.: +49 (221) 470-89606
>> > > E-Mail: nieroda.lech at uni-koeln.de
>> > > _______________________________________________
>> > > torqueusers mailing list
>> > > torqueusers at supercluster.org
>> > > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> > _______________________________________________
>> > torqueusers mailing list
>> > torqueusers at supercluster.org
>> > http://www.supercluster.org/mailman/listinfo/torqueusers
>> >
>> >
>> >
>> > --
>> > David Beer | Senior Software Engineer
>> > Adaptive Computing
>>
>>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130923/473f5a89/attachment.html 


More information about the torqueusers mailing list