[Mauiusers] Re: [torqueusers] excessive emails
garrick at usc.edu
Fri Jul 1 20:12:53 MDT 2005
On Fri, Jul 01, 2005 at 08:41:23AM -0600, Dave Jackson alleged:
> - Adding an option to TORQUE to issue an 'asynchronous' cancel which
> will cancel the job from pbs_server and, when, if ever, the MOM does
> report back to the server, cancels the jobs on the compute nodes.
> However, from the time the asynchronous cancel is issued, pbs_server
> will no longer report the job.
I don't know if I like that last part. How would the admin or user know if
there is a problem? In normal cases, I think pbs_server should continue to
report the job (in an E state) when the cancel is issued.
How about if pbs_server immediately sets the state to E when a cancel request
is received. Then it sends off the cancel request to mom superior. After 5
minutes, if mom superior hasn't responded or completed the request, pbs_server
then goes asynchronous: stop reporting the job, send 1 email to the user and
admin, and reissues cancel requests every 5 minutes without sending additional
And maybe, just maybe, nodes involved with an async cancel should be marked
> Also, while on a somewhat related subject, we are looking into
> modifying pbs_mom to use the 'killpg()' system call when terminating a
> job. Currently, mom will issue a SIGTERM to the child process for a
> couple of seconds, then issue a SIGKILL. This killpg() call will send
> signals to the child process and the entire process tree (all other
> processes in the child's process group). We are currently using this
> capability in some other non-TORQUE projects but would like to know
> people's experiences with it. Is it missing or broken on some
> platforms? Are there 'gotcha' associated with its use on certain
> systems? Are there other comparable projects already utilizing it?
Couldn't users get around this easily with setpgrp()? Maybe even accidentally?
Maybe fallback on kill if killpg() doesn't kill the parent process? I dunno,
I'm pretty much happy with things right now.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050701/3df21fc1/attachment-0001.bin
More information about the torqueusers