[torqueusers] Signalling on multi node jobs.
garrick at usc.edu
Tue Sep 20 13:38:55 MDT 2005
We're talking about 2 different things here: killing a job and enabling
What does the PBS specification say about this?
For killing a job, I think we need to get rid of the extra SIGKILLs that
MOM automatically sends after a SIGTERM. The follow-up DeleteJob
request from pbs_server should then SIGKILL everything it can find on
all nodes. With that policy, I think sending SIGTERMs to process groups
on all nodes will work out fine. The top-level script can easily trap
it and stick around long enough to copy output files.
This a common script on my cluster that doesn't do what the user expects
when the walltime limit is reached:
cp $inputfiles /tmp
cp $outputfiles ~/
The user wants the MPI processes to be killed letting the script
continue executing long enough to copy the output files.
What if user processes could tell MOM what they want through the TM
interface? Maybe by default a suspend is sent to all process groups,
but then (hypothetically) mpiexec could tell MOM, "please just tell me
about a suspend request, I'll handle it myself."
On Tue, Sep 20, 2005 at 12:59:44PM -0600, Dave Jackson alleged:
> It is a single line change to kill the process group but there was
> some discussion against it so this was shelved for the time being. I
> think one issue was if mom signal a process's children, it may prevent
> the parent process from cleanly shutting them down using its own custom
> Happy to roll it in or make it a configurable option.
> On Tue, 2005-09-20 at 11:35 -0700, Garrick Staples wrote:
> > On Mon, Sep 19, 2005 at 10:39:26PM +0200, Roy Dragseth alleged:
> > > Hi.
> > >
> > > On the mpiexec list we have been discussing how to get suspend/resume work
> > > with mpiexec. I thought that if you send a signal using qsig or whatever it
> > > gets forwarded to all nodes in a job, but that does not seem to be the case.
> > > Only the mother superior receives the signal, is this the intended behaviour?
> > That is the expected behaviour currently. Only MS signals processes.
> > Historically only the "top level" process is signalled (the user's
> > script). Dave was talking about changing that to kill() the entire
> > process group, but I'm not sure if that happened.
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050920/2fca99d5/attachment.bin
More information about the torqueusers