[torqueusers] Re: kill_delay

Garrick Staples garrick at clusterresources.com
Sun Feb 25 18:51:07 MST 2007


On Sun, Feb 25, 2007 at 12:07:49AM +0100, Roy Dragseth alleged:
> This is hardcoded into src/resmom/linux/mom_mach.c around line 1910.  The kill 
> procedure iterates up to 20 times with a nanosleep call of 0.25 seconds 
> before it continues and does a hard kill on the process.  I do not know why 
> this have to be there, the -TERM and -KILL signalling should be left at the 
> discression of pbs_server which has the kill_delay variable.  I do not think 
> the kill_delay variable is forwarded to the moms.

That loop has always really bugged me.  If you do something largish
with TM, like launch >1000 tasks, that loop takes forever to complete.

I don't know exactly why or when it was added, but OpenPBS didn't have
it.  I can easily imagine someone was trying to make pbs_mom very
thorough in the art of process massacre.

That said, kill_delay does actually work correctly.  pbs_server will
send out KILLs after kill_delay seconds if the job hasn't exited.



More information about the torqueusers mailing list