[Mauiusers] Re: [torqueusers] excessive emails

Garrick Staples garrick at usc.edu
Fri Jul 1 20:12:53 MDT 2005


On Fri, Jul 01, 2005 at 08:41:23AM -0600, Dave Jackson alleged:
> - Adding an option to TORQUE to issue an 'asynchronous' cancel which
> will cancel the job from pbs_server and, when, if ever, the MOM does
> report back to the server, cancels the jobs on the compute nodes.
> However, from the time the asynchronous cancel is issued, pbs_server
> will no longer report the job.

I don't know if I like that last part.  How would the admin or user know if
there is a problem?  In normal cases, I think pbs_server should continue to
report the job (in an E state) when the cancel is issued.

How about if pbs_server immediately sets the state to E when a cancel request
is received.  Then it sends off the cancel request to mom superior.  After 5
minutes, if mom superior hasn't responded or completed the request, pbs_server
then goes asynchronous: stop reporting the job, send 1 email to the user and
admin, and reissues cancel requests every 5 minutes without sending additional
emails.

And maybe, just maybe, nodes involved with an async cancel should be marked
"down".

 
>   Also, while on a somewhat related subject, we are looking into
> modifying pbs_mom to use the 'killpg()' system call when terminating a
> job.  Currently, mom will issue a SIGTERM to the child process for a
> couple of seconds, then issue a SIGKILL.  This killpg() call will send
> signals to the child process and the entire process tree (all other
> processes in the child's process group).  We are currently using this
> capability in some other non-TORQUE projects but would like to know
> people's experiences with it.  Is it missing or broken on some
> platforms?  Are there 'gotcha' associated with its use on certain
> systems?  Are there other comparable projects already utilizing it?

Couldn't users get around this easily with setpgrp()?  Maybe even accidentally?
Maybe fallback on kill if killpg() doesn't kill the parent process?  I dunno,
I'm pretty much happy with things right now.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050701/3df21fc1/attachment-0001.bin


More information about the torqueusers mailing list