[torqueusers] Jobs exceeding walltime not removed

Stephen Childs childss at cs.tcd.ie
Thu Jan 15 02:37:15 MST 2009


Jobs that are killed due to exceeding their allocated walltime seem to 
stick around in the torque queue for ever. For example, job 144673 is 
'successfully' cancelled by Maui:

01/15 09:20:34 MPBSJobCancel(144673,base,CMsg,Msg,MOAB_INFO:  job exceeded 
wallclock limit
01/15 09:20:34 INFO:     job '144673' successfully cancelled

But is still in the queue:

[root at gridgate ~]# qstat  144673
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
144673.gridgate           STDIN            xxxxx          00:00:00 R coneday

In the torque logs I see this again and again:

01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job 
deleted at request of root at gridgate.cs.tcd.ie
01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent 
signal SIGTERM on delete
01/15/2009 09:35:23;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent 
signal SIGKILL on delete

There is no trace of any processes belonging to the job on the cluster node.

Any idea what's going on or how to debug further?

Dr. Stephen Childs,
Research Fellow, EGEE Project,    phone:                    +353-1-8961797
Computer Architecture Group,      email:        Stephen.Childs @ cs.tcd.ie
Trinity College Dublin, Ireland   web: http://www.cs.tcd.ie/Stephen.Childs
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 2952 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090115/742c86f8/smime.bin

More information about the torqueusers mailing list