[torquedev] Is kill_delay broken?

Josh Butikofer josh at clusterresources.com
Tue Mar 31 13:13:42 MDT 2009


We've had a customer report a possible regression in how the pbs_server
attribute "kill_delay" is supposed to work. This is what the man page says about it:

The amount of the time delay between the  sending  of  SIGTERM  and
     SIGKILL  when a qdel command is issued against a running job.  This
          is overriden by the execution queue attribute  of  the  same  name.
Format: integer seconds; default value: 2 seconds.

In other words, kill_delay controls when the pbs_server sends a SIGKILL to a
job. For example, when qdel is used on a running job, the pbs_server sends a
SIGTERM to the job immediately. The server then adds an internal task to send
the SIGKILL, but puts a time on it <kill_delay> seconds in the future.

When the MOM gets the SIGTERM request, it passes that signal on to all of the
tasks in the job's session. For example, our typical test job has three tasks in 
the job's session:

root     11147     1  0 Mar20 ?        00:01:08 pbs_mom
josh     26482 11147  0 12:59 ?        00:00:00 -bash
josh     26483 26482  0 12:59 ?        00:00:00 -bash
josh     26484 26483  0 12:59 ?        00:00:00 /home/josh/sigtest

The sigtest task/process has a handler to catch and ignore the SIGTERM, but that
is not true for bash. This means bash is killed immediately.

Next, the MOM runs scan_for_terminated() then sees the -bash task terminate and
then does several things, one of which is to call kill_task() with a SIGKILL.
Kill_task then issues a SIGKILL for any pid that is still in the /proc table and
matches the session ID. This then kills sigtest *early*. In other words,
kill_delay is subverted because the pbs_mom sends a SIGKILL before the server
tells it to. This seems to make kill_delay, well, useless. :)

Does anyone out there know if this is a regression? In an effort to make the
pbs_mom more tidy, did we inadvertently break kill_delay's intended
functionality? Or am I perhaps missing something? Are there cluster admins out
there that use kill_delay successfully?

BTW, this test was done in TORQUE 2.3.x on Linux.

Josh Butikofer
Cluster Resources, Inc.

More information about the torquedev mailing list