[torquedev] Is kill_delay broken?

Josh Butikofer josh at clusterresources.com
Tue Mar 31 15:22:39 MDT 2009


Michael,

Thanks for the response. See my comments below:

Michael Barnes wrote:
> On Tue, Mar 31, 2009 at 01:13:42PM -0600, Josh Butikofer wrote:
>> Everyone,
>>
>> We've had a customer report a possible regression in how the pbs_server
>> attribute "kill_delay" is supposed to work. This is what the man page says 
>> about it:
>>
>> kill_delay
>> The amount of the time delay between the  sending  of  SIGTERM  and
>>     SIGKILL  when a qdel command is issued against a running job.  This
>>          is overriden by the execution queue attribute  of  the  same  name.
>> Format: integer seconds; default value: 2 seconds.
>>
>> In other words, kill_delay controls when the pbs_server sends a SIGKILL to a
>> job. For example, when qdel is used on a running job, the pbs_server sends a
>> SIGTERM to the job immediately. The server then adds an internal task to 
>> send
>> the SIGKILL, but puts a time on it <kill_delay> seconds in the future.
>>
>> When the MOM gets the SIGTERM request, it passes that signal on to all of 
>> the
>> tasks in the job's session. For example, our typical test job has three 
>> tasks in the job's session:
>>
>> root     11147     1  0 Mar20 ?        00:01:08 pbs_mom
>> ...
>> josh     26482 11147  0 12:59 ?        00:00:00 -bash
>> josh     26483 26482  0 12:59 ?        00:00:00 -bash
>> josh     26484 26483  0 12:59 ?        00:00:00 /home/josh/sigtest
>>
>>
>> The sigtest task/process has a handler to catch and ignore the SIGTERM, but 
>> that is not true for bash. This means bash is killed immediately.
>>
>> Next, the MOM runs scan_for_terminated() then sees the -bash task terminate 
>> and
>> then does several things, one of which is to call kill_task() with a 
>> SIGKILL.
>> Kill_task then issues a SIGKILL for any pid that is still in the /proc 
>> table and
>> matches the session ID. This then kills sigtest *early*. In other words,
>> kill_delay is subverted because the pbs_mom sends a SIGKILL before the 
>> server
>> tells it to. This seems to make kill_delay, well, useless. :)
>>
>> Does anyone out there know if this is a regression? In an effort to make the
>> pbs_mom more tidy, did we inadvertently break kill_delay's intended
>> functionality? Or am I perhaps missing something? Are there cluster admins 
>> out
>> there that use kill_delay successfully?
>>
>> BTW, this test was done in TORQUE 2.3.x on Linux.
> 
> AFAIK, the session ID is the process ID of the program that the pbs_mom
> runs on behalf of the user, and the pbs_mom does not walk down the
> process tree beyond the sesson ID when it sends a signal that ID.

Looking in src/resmom/linux/mom_mach.c at kill_task() it does appear to walk 
through all processes in the system via /proc, and any process that is part of 
the given session, will be issued the signal passed via kill_task().

> So, in the above example we have pbs_mom (11147) -> bash (26482) -> bash
> (26483) -> sigtest (26484)
> 
> When pbs_mom sends a TERM signal to 26482, the signal is sent to all of
> its children as well.

AFAIK, this depends on how kill() is called. TORQUE is using kill_task() to send 
the original SIGTERM and it uses kill() with a positive PID to signify that only 
process 26482 should be killed, and not any of the other members in its process 
group. (In all the cases in my test, kill_task() is called with 
(ptask,SIGTERM,0) where the last parameter tells TORQUE to not kill the entire 
process group.)

So, the bash process being killed makes sense.

Yeah, I agree. It makes sense that it dies.

> And then sigtest would be a stray process, which is a common problem on
> clusters.

The sigtest process is owned by the second bash process (26483), so it isn't an 
orphan per UNIX standards, but the second bash process does become orphaned ... 
if you meant stray to mean "orphaned process."

That is why there is a delay between TERM and KILL, because
KILL cannot be trapped or passed down the process group, and once a
process is KILLed, the child processes are now not under the pbs_mom's
control, but under init's control.

What I'm seeing is the pbs_mom sending a SIGTERM when I issue a qdel 
(process_request() -> dispatch_request() -> req_signaljob() -> kill_job()  -> 
kill_task() -> kill()).

I then see, IMMEDIATELY thereafter, the pbs_mom sends a SIGKILL 
(scan_for_terminated() -> kill_task() -> kill()).

The parameter kill_delay is a pbs_server config option and only affects how the 
pbs_server sends signaljob messages to the pbs_mom. In our tests, using TORQUE 
2.1.x - 2.3.x and even 2.4.x, no matter what kill_delay is set to, the job will 
still immediately be killed with a SIGKILL. This is the problem at hand. I can 
see this in the log files for the pbs_mom--no communication with the server is 
even necessary.

  > With the kill delay, I've tested this on older TORQUE versions, and it
> worked fine.  I believe its also in the mom logs.  It says something
> like sending TERM signal, then sending KILL signal.

What version of older TORQUE's are you talking about? Older than 2.0?

--Josh Butikofer


More information about the torquedev mailing list