[torqueusers] SIGTERM and pbsdsh

Tim Freeman tfreeman at mcs.anl.gov
Thu Nov 29 14:06:13 MST 2007


On Tue, 27 Nov 2007 09:52:36 -0600
Tim Freeman <tfreeman at mcs.anl.gov> wrote:

> I am starting the same executable on N nodes using pbsdsh -n.  During a qdel,
> SIGTERM signals do not look like they are propagating to each process, only a
> SIGKILL from the initial looks of it (there's a SIGTERM handler in the
> executable that is not getting invoked).
> 
> The application I'm running greatly benefits from getting to run a cleanup
> routine if cancelled.  Is there an option to pbsdsh or some technique to use
> where I can make this happen? 
> 
> Thanks,
> Tim
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

I'm looking at the src/cmds/pbsdsh.c source (2.2.1), it looks like wait_for_task
() will see "fire_phasers" set to a signal int and pass a signal on to the
nodes via tm_kill().  So the intent is that SIGTERM should be sent to the
executable's on each node if I'm reading it correctly. 

grep'ing the source tree shows this is the only use of tm_kill.

I tried to find where the normal SIGTERM + SIGKILL behavior comes from, this
looks like it is it:

>From src/resmom/mom_main.c

      if (c & JOB_SVFLG_OVERLMT2) 
        {
        kill_job(pjob,SIGKILL);

        continue;
        }

      if (c & JOB_SVFLG_OVERLMT1) 
        {
        kill_job(pjob,SIGTERM);

        pjob->ji_qs.ji_svrflags |= JOB_SVFLG_OVERLMT2;

        continue;
        }


So the question (for me at least) is if tm_kill (with SIGTERM argument) is
triggering the same thing or not.  But looking at the implementation of
tm_kill, I cannot see how it works. The "sig" parameter passed to is passed to
"diswsi": diswsi(local_conn, sig), and I don't immediately follow from this
point on how the message gets to the MOM from here (or more importantly what
this message is exactly, if it contains the "sig" code, and how/where the MOM
interprets the message).

I'll keep digging, but any insight is appreciated.   I unfortunately
temporarily don't have access to the nodes right now, so I will have to wait to
run tests with more information recorded to see if I can narrow down what is
happening.

Thankyou,
Tim


More information about the torqueusers mailing list