[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Kenneth Hoste kenneth.hoste at ugent.be
Thu Aug 1 06:33:25 MDT 2013


Was this problem ever resolved?

I noticed through http://www.clusterresources.com/pipermail/torqueusers/2012-December/015352.html that David looked into this, but the archive doesn't show any further followup.

It seems we're currently suffering from a very similar problem with the Torque v4.1.6...


regards,

Kenneth

PS: Please keep me in CC when replying, for some reason I'm no longer receiving mails from torqueusers@ even though I'm subscribed...


On 22 Nov 2012, at 16:46, Lech Nieroda wrote:

> Dear list,
> 
> we have another serious problem since our upgrade to Torque 4.1.3. We 
> are using it with Maui 3.3.1. The problem in a nutshell: some few, 
> random jobs are suddenly "unknown" to the server, it changes their 
> status to EXITING-SUBSTATE55 and requests a silent kill on the compute 
> nodes. The job then dies, the processes are killed on the node, there is 
> no "Exit_status" in the server-log, no entry in maui/stats, no 
> stdout/stderr files. The users are, understandably, not amused.
> 
> It doesn't seem to be user or application specific. Even a single 
> instance from a job array can get blown away in this way while all other 
> instances end normally.
> 
> Here are some logs of such a job (681684[35]):
> 
> maui just assumes a successful completion:
> [snip]
> 11/21 19:24:49 MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is  1.000000, 
> 1.000000, 1
> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been removed from 
> the queue.  assuming successful completion
> 11/21 19:25:55 MJobProcessCompleted(681684[35])
> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T: 
> 10903  PS: 10903  A: 0.063096
> 11/21 19:25:55 MJobSendFB(681684[35])
> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> 11/21 19:25:55 MJobRemove(681684[35])
> 11/21 19:25:55 MJobDestroy(681684[35])
> [snap]
> 
> pbs_server decides at 19:25:11, after 3 hours runtime, that the job is 
> unknown (grepped by JobID from the server logs):
> [snip]
> 11/21/2012 
> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate: 
> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to 
> RUNNING-RUNNING (4-42)
> 11/21/2012 
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: 
> setting job 681684[35].cheops10 state from RUNNING-RUNNING to 
> QUEUED-SUBSTATE55 (1-55)
> 11/21/2012 
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate: 
> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to 
> EXITING-SUBSTATE55 (5-55)
> 11/21/2012 
> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing from 
> smp, state EXITING
> 11/21/2012 
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom, stray 
> job 681684[35].cheops10 found on cheops21316
> [snap]
> 
> pbs_client just kills the processes:
> [snip]
> 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job 
> 681684[35].cheops10 started, pid = 17452
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17452 task 
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
> (pid=17452/state=R) after sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
> (pid=17452/state=Z) after sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17692 task 
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
> (pid=17692/state=R) after sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17703 task 
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
> (pid=17703/state=R) after sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid 17731 task 
> 1 gracefully with sig 15
> 11/21/2012 19:25:14;0008; 
> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process 
> (pid=17731/state=R) after sig 15
> 11/21/2012 19:25:15;0080; 
> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job 
> 681684[35].cheops10 task 1 terminated, sid=17452
> 11/21/2012 19:25:15;0008;   pbs_mom.32254;Job;681684[35].cheops10;job 
> was terminated
> 11/21/2012 19:25:50;0001; 
> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on server, 
> deleting locally
> 11/21/2012 19:25:50;0080; 
> pbs_mom.32254;Job;681684[35].cheops10;removed job script
> [snap]
> 
> Sometimes, the pbs_mom logs include this message before the killing starts:
> [snip]
> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, 
> type=StatusJob, from PBS_Server at cheops10
> [snap]
> 
> And finally, some job informations given to epilogue:
> [snip]
> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: 
> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared: 
> 681684[35].cheops10,hthiele0,cheops21316,Job Information: 
> userid=hthiele0, 
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00', 
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34', 
> queue=smp, account=ccg-ngs, exitcode=271
> [snap]
> 
> This happens rarely (about 1 in 3000). However, silent deletions of 
> random jobs aren't exactly a trifling matter.
> I could try to disable the mom_job_sync option, which could perhaps 
> prevent the process killing of unknown jobs, but it would also leave 
> corrupt/pre-execution jobs alive.
> 
> Can this be fixed?
> 
> On a side-note, here are some further, minor Bugs I've noticed in the 
> Torque 4.1.3. Version:
> - the epilogue script is usually invoked twice and sometimes even 
> several times
> - when explicit node lists are used, e.g. nodes=node1:ppn=2+node2:ppn=2, 
> then the number of "tasks" as seen by qstat is zero
> - there have been some API changes between Torque 2.x and Torque 4.x, so 
> that two maui calls had to be altered in order to build against Torque 
> 4.x (get_svrport, openrm).
> 
> 
> Regards,
> Lech Nieroda
> 
> -- 
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
> 
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list