[torquedev] jobs owners purging own jobs?

Garrick Staples garrick at usc.edu
Fri Feb 24 12:10:05 MST 2006

On Fri, Feb 24, 2006 at 08:54:06AM -0500, Andrew J Caird alleged:
> On Fri, 24 Feb 2006, Garrick Staples wrote:
> >On Thu, Feb 23, 2006 at 11:28:16PM -0500, Caird, Andrew J alleged:
> >>Thanks Garrick,
> >>
> >>What makes it dangerous, and can it be made safer?  I'm willing to give 
> >>a try if it's possible.
> >
> >qdel -p simply tells pbs_server to purge everything it knows about the 
> >job, without talking to MS.  The dangerous part is that it doesn't talk 
> >to MS.  The job may possibly still be running, epilogue scripts may 
> >still be run, and possibly random processes are going to get killed. 
> >It is basicly intentionally breaking the entire job state machine and is 
> >only a last resort to removing a job.
> >
> >This functionality should not be needed by users.  If you frequently 
> >find that jobs aren't killable, then maybe that is something we need to 
> >fix.
> >
> >>My goal is to keep funtionality we had with PBSPro ('-W force').
> >
> >I don't know PBSPro, but I doubt a user-accessible -W force is 
> >equivalent to -p.
>   I, of course, haven't seen the PBSPro source, but from the man page:
>    "The -W force option, where force is the literal character string
>     force, directs  that the  job  is  to be deleted even if the node
>     on which the job is executing is unreachable"
>   This is what we need.
>   When a node crashes, we often see PBS continue to consider the job 
> active.  Because of the per-user and per-group job limits we have, our 
> users feel cheated.
>   If there is a way for PBS to remove a job when the node is marked 
> 'down', that would also be fine.  My experience with Torque is that when a 
> node is down, 'qdel <jobid>' doesn't work.

A crashed node isn't a good enough reason for qdel -p.  When the node is
rebooted, it will still have the job and the results may be undesirable.
Not to mention, sister nodes will still have the job running without
ever being cleaned up.

To do this properly, pbs_server would need to contact the sister MOMs to
cleanly delete the job, and MS would need to check with pbs_server
before running epilogue.  We'd need to decide on the exact behaviour.
Clearly all MOMs should kill any running processes, but should
epilogue.parallel/epilogue.user still be run?

If and when we get these bits into place, then I'd say we add a -W force

Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20060224/6b40c49d/attachment.bin

More information about the torquedev mailing list