[torquedev] Is kill_delay broken?
josh at clusterresources.com
Tue Mar 31 13:13:42 MDT 2009
We've had a customer report a possible regression in how the pbs_server
attribute "kill_delay" is supposed to work. This is what the man page says about it:
The amount of the time delay between the sending of SIGTERM and
SIGKILL when a qdel command is issued against a running job. This
is overriden by the execution queue attribute of the same name.
Format: integer seconds; default value: 2 seconds.
In other words, kill_delay controls when the pbs_server sends a SIGKILL to a
job. For example, when qdel is used on a running job, the pbs_server sends a
SIGTERM to the job immediately. The server then adds an internal task to send
the SIGKILL, but puts a time on it <kill_delay> seconds in the future.
When the MOM gets the SIGTERM request, it passes that signal on to all of the
tasks in the job's session. For example, our typical test job has three tasks in
the job's session:
root 11147 1 0 Mar20 ? 00:01:08 pbs_mom
josh 26482 11147 0 12:59 ? 00:00:00 -bash
josh 26483 26482 0 12:59 ? 00:00:00 -bash
josh 26484 26483 0 12:59 ? 00:00:00 /home/josh/sigtest
The sigtest task/process has a handler to catch and ignore the SIGTERM, but that
is not true for bash. This means bash is killed immediately.
Next, the MOM runs scan_for_terminated() then sees the -bash task terminate and
then does several things, one of which is to call kill_task() with a SIGKILL.
Kill_task then issues a SIGKILL for any pid that is still in the /proc table and
matches the session ID. This then kills sigtest *early*. In other words,
kill_delay is subverted because the pbs_mom sends a SIGKILL before the server
tells it to. This seems to make kill_delay, well, useless. :)
Does anyone out there know if this is a regression? In an effort to make the
pbs_mom more tidy, did we inadvertently break kill_delay's intended
functionality? Or am I perhaps missing something? Are there cluster admins out
there that use kill_delay successfully?
BTW, this test was done in TORQUE 2.3.x on Linux.
Cluster Resources, Inc.
More information about the torquedev