[torqueusers] Re: kill_delay
Roy Dragseth
roy.dragseth at cc.uit.no
Tue Feb 27 14:20:34 MST 2007
On Tuesday 27 February 2007, Pete Wyckoff wrote:
> Roy.Dragseth at cc.uit.no wrote on Tue, 27 Feb 2007 10:50 +0100:
> > After some tinkering with the code I've come to the conclusion that the
> > kill loop makes a lot of sense for parallel jobs, as you want to give an
> > mpi launcher the time to clean up before it is killed with an untrappable
> > signal. The loop is only executed on a SIGKILL. The annoying delay
> > should be fixed by doing a fork.
>
> Apologies in advance if I'm not paying enough attention to Torque
> development lately. The kill loop in mom appears to be the source
> of a regression in Torque that affects mpiexec users:
>
> http://www.supercluster.org/pipermail/torqueusers/2006-November/004714.html
>
> Do things work properly now so that a parallel job launcher gets
> the obit signals and can clean up? If so, I'll be happy to remove
> that issue from the list. Thanks,
I haven't touched the kill_task() code.
I'm voting for removing the signal loop in kill_task alltogether (or ifdef-ing
it out). We should rely on the server side to take care of the correct delays
for job cleanup. We should let SIGKILL be just that, an untrappable certain
death with no cleanup as it is intended to be.
But, should the mom's watchdog functionality be removed too? If we do not
remove it we (in my opinion) definitely need a configurable mom_kill_delay
parameter.
r.
--
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, High Performance Computing System Administrator
Direct call: +47 77 64 62 56. email: royd at cc.uit.no
More information about the torqueusers
mailing list