[torqueusers] Jobs exceeding walltime not removed
Stephen Childs
childss at cs.tcd.ie
Thu Jan 15 02:37:15 MST 2009
Hi,
Jobs that are killed due to exceeding their allocated walltime seem to
stick around in the torque queue for ever. For example, job 144673 is
'successfully' cancelled by Maui:
01/15 09:20:34 MPBSJobCancel(144673,base,CMsg,Msg,MOAB_INFO: job exceeded
wallclock limit
01/15 09:20:34 INFO: job '144673' successfully cancelled
But is still in the queue:
[root at gridgate ~]# qstat 144673
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
144673.gridgate STDIN xxxxx 00:00:00 R coneday
In the torque logs I see this again and again:
01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job
deleted at request of root at gridgate.cs.tcd.ie
01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent
signal SIGTERM on delete
01/15/2009 09:35:23;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent
signal SIGKILL on delete
There is no trace of any processes belonging to the job on the cluster node.
Any idea what's going on or how to debug further?
Stephen
--
Dr. Stephen Childs,
Research Fellow, EGEE Project, phone: +353-1-8961797
Computer Architecture Group, email: Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland web: http://www.cs.tcd.ie/Stephen.Childs
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2952 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090115/742c86f8/smime.bin
More information about the torqueusers
mailing list