[torqueusers] Jobs exceeding walltime not removed

Glen Beane glen.beane at gmail.com
Thu Jan 15 05:23:56 MST 2009


On Thu, Jan 15, 2009 at 4:37 AM, Stephen Childs <childss at cs.tcd.ie> wrote:
> Hi,
>
> Jobs that are killed due to exceeding their allocated walltime seem to stick
> around in the torque queue for ever. For example, job 144673 is
> 'successfully' cancelled by Maui:
>
> 01/15 09:20:34 MPBSJobCancel(144673,base,CMsg,Msg,MOAB_INFO:  job exceeded
> wallclock limit
> 01/15 09:20:34 INFO:     job '144673' successfully cancelled
>
> But is still in the queue:
>
> [root at gridgate ~]# qstat  144673
> Job id                    Name             User            Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 144673.gridgate           STDIN            xxxxx          00:00:00 R coneday
>
> In the torque logs I see this again and again:
>
> 01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job
> deleted at request of root at gridgate.cs.tcd.ie
> 01/15/2009 09:35:21;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent
> signal SIGTERM on delete
> 01/15/2009 09:35:23;0008;PBS_Server;Job;144673.gridgate.cs.tcd.ie;Job sent
> signal SIGKILL on delete
>
> There is no trace of any processes belonging to the job on the cluster node.
>
> Any idea what's going on or how to debug further?

before anyone can help you I think we'll need to know which version of
TORQUE you are using.


More information about the torqueusers mailing list