[torqueusers] Signalling on multi node jobs.

garrick garrick at usc.edu
Tue Sep 20 14:20:39 MDT 2005


On Tue, Sep 20, 2005 at 04:07:16PM -0400, Stewart.Samuels at sanofi-aventis.com alleged:
> Garrick,
> 
> We have seen this behaviour with jobs using pvm under TORQUE as well.  Would your solution apply here as well?  It sounds like the very same issue.

If you mean that your users want their scripts to live a bit longer,
then yes.  It would apply to all types of jobs.

 
> 	Stewart
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org
> [mailto:torqueusers-bounces at supercluster.org]On Behalf Of Garrick
> Staples
> Sent: Tuesday, September 20, 2005 3:39 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Signalling on multi node jobs.
> 
> 
> We're talking about 2 different things here: killing a job and enabling
> suspend/resume.
> 
> What does the PBS specification say about this?
> 
> For killing a job, I think we need to get rid of the extra SIGKILLs that
> MOM automatically sends after a SIGTERM.  The follow-up DeleteJob
> request from pbs_server should then SIGKILL everything it can find on
> all nodes.  With that policy, I think sending SIGTERMs to process groups
> on all nodes will work out fine.  The top-level script can easily trap
> it and stick around long enough to copy output files.
> 
> This a common script on my cluster that doesn't do what the user expects
> when the walltime limit is reached:
> 
> cp $inputfiles /tmp
> cd /tmp
> mpiexec blah
> cp $outputfiles ~/
> 
> The user wants the MPI processes to be killed letting the script
> continue executing long enough to copy the output files.
> 
> 
> What if user processes could tell MOM what they want through the TM
> interface?  Maybe by default a suspend is sent to all process groups,
> but then (hypothetically) mpiexec could tell MOM, "please just tell me
> about a suspend request, I'll handle it myself."
> 
> 
> On Tue, Sep 20, 2005 at 12:59:44PM -0600, Dave Jackson alleged:
> > Garrick,
> > 
> >   It is a single line change to kill the process group but there was
> > some discussion against it so this was shelved for the time being.  I
> > think one issue was if mom signal a process's children, it may prevent
> > the parent process from cleanly shutting them down using its own custom
> > method.
> > 
> >   Happy to roll it in or make it a configurable option.
> > 
> > Dave
> > 
> > On Tue, 2005-09-20 at 11:35 -0700, Garrick Staples wrote:
> > > On Mon, Sep 19, 2005 at 10:39:26PM +0200, Roy Dragseth alleged:
> > > > Hi.
> > > > 
> > > > On the mpiexec list we have been discussing how to get suspend/resume work 
> > > > with mpiexec.  I thought that if you send a signal using qsig or whatever it 
> > > > gets forwarded to all nodes in a job, but that does not seem to be the case.  
> > > > Only the mother superior receives the signal, is this the intended behaviour?
> > > 
> > > That is the expected behaviour currently.  Only MS signals processes.
> > > Historically only the "top level" process is signalled (the user's
> > > script).  Dave was talking about changing that to kill() the entire
> > > process group, but I'm not sure if that happened.
> > > 
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > 
> 
> -- 
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20050920/494ff6c8/attachment.bin


More information about the torqueusers mailing list