[torqueusers] Re: kill_delay

Roy Dragseth roy.dragseth at cc.uit.no
Tue Feb 27 14:20:34 MST 2007


On Tuesday 27 February 2007, Pete Wyckoff wrote:
> Roy.Dragseth at cc.uit.no wrote on Tue, 27 Feb 2007 10:50 +0100:
> > After some tinkering with the code I've come to the conclusion that the
> > kill loop makes a lot of sense for parallel jobs, as you want to give an
> > mpi launcher the time to clean up before it is killed with an untrappable
> > signal. The loop is only executed on a SIGKILL.  The annoying delay
> > should be fixed by doing a fork.
>
> Apologies in advance if I'm not paying enough attention to Torque
> development lately.  The kill loop in mom appears to be the source
> of a regression in Torque that affects mpiexec users:
>
> http://www.supercluster.org/pipermail/torqueusers/2006-November/004714.html
>
> Do things work properly now so that a parallel job launcher gets
> the obit signals and can clean up?  If so, I'll be happy to remove
> that issue from the list.  Thanks,

I haven't touched the kill_task() code.

I'm voting for removing the signal loop in kill_task alltogether (or ifdef-ing 
it out). We should rely on the server side to take care of the correct delays 
for job cleanup. We should let SIGKILL be just that, an untrappable certain 
death with no cleanup as it is intended to be.

But, should the mom's watchdog functionality be removed too?  If we do not 
remove it we (in my opinion) definitely need a configurable mom_kill_delay 
parameter.

r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
         Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the torqueusers mailing list