On Fri, Jul 01, 2005 at 08:41:23AM -0600, Dave Jackson alleged:
> - Adding an option to TORQUE to issue an 'asynchronous' cancel which
> will cancel the job from pbs_server and, when, if ever, the MOM does
> report back to the server, cancels the jobs on the compute nodes.
> However, from the time the asynchronous cancel is issued, pbs_server
> will no longer report the job.

I don't know if I like that last part.  How would the admin or user know if
there is a problem?  In normal cases, I think pbs_server should continue to
report the job (in an E state) when the cancel is issued.

How about if pbs_server immediately sets the state to E when a cancel request
is received.  Then it sends off the cancel request to mom superior.  After 5
minutes, if mom superior hasn't responded or completed the request, pbs_server
then goes asynchronous: stop reporting the job, send 1 email to the user and
admin, and reissues cancel requests every 5 minutes without sending additional

And maybe, just maybe, nodes involved with an async cancel should be marked

>   Also, while on a somewhat related subject, we are looking into
> modifying pbs_mom to use the 'killpg()' system call when terminating a
> job.  Currently, mom will issue a SIGTERM to the child process for a
> couple of seconds, then issue a SIGKILL.  This killpg() call will send
> signals to the child process and the entire process tree (all other
> processes in the child's process group).  We are currently using this
> capability in some other non-TORQUE projects but would like to know
> people's experiences with it.  Is it missing or broken on some
> platforms?  Are there 'gotcha' associated with its use on certain
> systems?  Are there other comparable projects already utilizing it?

Couldn't users get around this easily with setpgrp()?  Maybe even accidentally?
Maybe fallback on kill if killpg() doesn't kill the parent process?  I dunno,
I'm pretty much happy with things right now.

