[torqueusers] SIGTERM and pbsdsh
tfreeman at mcs.anl.gov
Thu Nov 29 15:13:10 MST 2007
On Thu, 29 Nov 2007 13:43:42 -0800
Garrick Staples <garrick at usc.edu> wrote:
> On Tue, Nov 27, 2007 at 09:52:36AM -0600, Tim Freeman alleged:
> > I am starting the same executable on N nodes using pbsdsh -n. During a
> > qdel, SIGTERM signals do not look like they are propagating to each
> > process, only a SIGKILL from the initial looks of it (there's a SIGTERM
> > handler in the executable that is not getting invoked).
> > The application I'm running greatly benefits from getting to run a cleanup
> > routine if cancelled. Is there an option to pbsdsh or some technique to use
> > where I can make this happen?
> There's 2 common things here. The first is "kill_delay", the queue attribute
> that specifies the time between the initial TERM and the later KILL. The
> default is too short.
> The second is that your top-level shell is catching the TERM signal and
> exiting. You need to ignore the TERM in your batch script.
Hi, thanks. Can you elaborate? The executable passed to pbsdsh is run on each
node. There is no script run first, and this executable has a SIGTERM handler.
Without pbsdsh all is well because the executable does get SIGTERM from the MOM
(this is as a result of qdel) and that's what I want for every copy on each
node when run via pbsdsh.
It is my understanding that the MOM on node #1 treats pbsdsh as the "job" and
pbsdsh turns around and fans out, invoking the real job on each of the nodes in
the group using the task mgmt API (from tm.h).
So pbsdsh gets SIGTERM from the regular MOM job_kill. pbsdsh (in my mind at
least) should then propagate this to each node. The source looks like that is
the attempt (because the specific caught signal int is passed to tm_kill).
But the evidence shows the executable's on each node just dying (its SIGTERM
handler does not get invoked), suggesting to me that tm_kill simply causes
those all to get a SIGKILL?
[[Also related, no matter the signal sent to pbsdsh, it just exits right after
invoking tm_kill, but this should at the very least (in my mind) cause SIGTERM
to be sent to each node's copy of the executable if that is what is happening.]]
More information about the torqueusers