[torquedev] kill_delay or something like it...
Patrick McGuigan
mcguigan at uta.edu
Wed Jul 9 12:44:25 MDT 2008
Hi,
Are there any suggestions on how to allow for a configurable (even hard
coded) control that would provide a longer time between SIGTERM and
SIGKILL in response to a qdel or a resource violation?
kill_delay should be this control, but the mom processing of a qdel
seems to imply that this has no control when the kill_delay is longer
than ~5 seconds. I turned on the full logging in the mom and killed a
job. I noticed that kill_job() calls kill_task() to send the initial
SIGTERM to the process tree, but the processing never seems to wait.
Instead, it appears that scan_for_terminated() gets called quickly. This
results in kill_task() being called again (this time for a kill), which
causes a second SIGTERM (for processes still running) to be sent before
entering a hard-coded timing loop and then sending SIGKILL.
I don't need to support MPI or other parallel jobs in the cluster. I
would prefer the simple idea of a single SIGTERM being sent to the
process tree and then doing nothing until the server sends a SIGKILL.
If that is too complicated, is it possible to extend the timing loop to
90 seconds from the default 5 seconds?
All insights are appreciated,
Patrick
More information about the torquedev
mailing list