[torquedev] kill_delay or something like it...

Patrick McGuigan mcguigan at uta.edu
Wed Jul 9 12:44:25 MDT 2008


Are there any suggestions on how to allow for a configurable (even hard 
coded) control that would provide a longer time between SIGTERM and 
SIGKILL in response to a qdel or a resource violation?

kill_delay should be this control, but the mom processing of a qdel 
seems to imply that this has no control when the kill_delay is longer 
than ~5 seconds.  I turned on the full logging in the mom and killed a 
job.  I noticed that kill_job() calls kill_task() to send the initial 
SIGTERM to the process tree, but the processing never seems to wait. 
Instead, it appears that scan_for_terminated() gets called quickly. This 
results in kill_task() being called again (this time for a kill), which 
causes a second SIGTERM (for processes still running) to be sent before 
entering a hard-coded timing loop and then sending SIGKILL.

I don't need to support MPI or other parallel jobs in the cluster.  I 
would prefer the simple idea of a single SIGTERM being sent to the 
process tree and then doing nothing until the server sends a SIGKILL.
If that is too complicated, is it possible to extend the timing loop to 
90 seconds from the default 5 seconds?

All insights are appreciated,


More information about the torquedev mailing list