[torqueusers] SIGTERM and pbsdsh

Pete Wyckoff pw at osc.edu
Fri Nov 30 11:07:26 MST 2007


csamuel at vpac.org wrote on Fri, 30 Nov 2007 13:50 +1100:
> On Fri, 30 Nov 2007, Tim Freeman wrote:
[..]
> > Hi, thanks.  Can you elaborate?  The executable passed to pbsdsh is
> > run on each node.  There is no script run first, and this
> > executable has a SIGTERM handler. Without pbsdsh all is well
> > because the executable does get SIGTERM from the MOM (this is as a
> > result of qdel) and that's what I want for every copy on each node
> > when run via pbsdsh.
> 
> It would be interesting to compare running the program under pbsdsh to 
> running it with Pete Wyckoff's "mpiexec" program using the -comm=none 
> option (to stop it doing an MPI setup work).
> 
> i.e. replace:
> 
> pbsdsh /path/to/foo $ARGS
> 
> with
> 
> mpiexec -comm=none /path/to/foo $ARGS
> 
> I have a memory of Pete having to do some work around this in the last 
> 6 months or so (but then my memory does suffer from random bitrot, so 
> it's always good to check..)

There is an ancient thread here:

http://www.supercluster.org/pipermail/torqueusers/2006-November/004714.html

where I complain about the need to ignore SIGTERM in the job script
so that mpiexec can reap over-limit jobs properly.  There is/was a
tight loop in Torque's kill_task() that seemed to be the root of the
problem.

Some more hand-wringing over this troublesome loop here:

http://www.supercluster.org/pipermail/torqueusers/2007-February/005111.html

I have not gone back to the code to see if anything has changed.
Would love to hear that this problem has been fixed.  It is the
source of a long-standing regression for mpiexec users.

		-- Pete


More information about the torqueusers mailing list