[torqueusers] SIGTERM and pbsdsh
pw at osc.edu
Fri Nov 30 11:07:26 MST 2007
csamuel at vpac.org wrote on Fri, 30 Nov 2007 13:50 +1100:
> On Fri, 30 Nov 2007, Tim Freeman wrote:
> > Hi, thanks. Can you elaborate? The executable passed to pbsdsh is
> > run on each node. There is no script run first, and this
> > executable has a SIGTERM handler. Without pbsdsh all is well
> > because the executable does get SIGTERM from the MOM (this is as a
> > result of qdel) and that's what I want for every copy on each node
> > when run via pbsdsh.
> It would be interesting to compare running the program under pbsdsh to
> running it with Pete Wyckoff's "mpiexec" program using the -comm=none
> option (to stop it doing an MPI setup work).
> i.e. replace:
> pbsdsh /path/to/foo $ARGS
> mpiexec -comm=none /path/to/foo $ARGS
> I have a memory of Pete having to do some work around this in the last
> 6 months or so (but then my memory does suffer from random bitrot, so
> it's always good to check..)
There is an ancient thread here:
where I complain about the need to ignore SIGTERM in the job script
so that mpiexec can reap over-limit jobs properly. There is/was a
tight loop in Torque's kill_task() that seemed to be the root of the
Some more hand-wringing over this troublesome loop here:
I have not gone back to the code to see if anything has changed.
Would love to hear that this problem has been fixed. It is the
source of a long-standing regression for mpiexec users.
More information about the torqueusers