[torqueusers] killing over limit jobs is unfriendly to mpiexec

Åke Sandgren ake.sandgren at hpc2n.umu.se
Fri Nov 24 00:45:03 MST 2006


On Fri, 2006-11-24 at 08:19 +0100, Åke Sandgren wrote:
> On Thu, 2006-11-23 at 15:48 -0500, Pete Wyckoff wrote:
> > Mpiexec catches this second SIGTERM and just exits, abandoning any
> > tasks.  The thought was that when users hit ctrl-c, it tries to
> > clean tasks up nicely, but if the batch system has hosed itself, a
> > second tap of ctrl-c will force mpiexec to exit.  If I were to
> > ignore future SIGTERMs, users would have to hit ctrl-z, then "kill
> > -9" the process to get it to go away.
> 
> BTW, ctrl-c send SIGINT not SIGTERM...
> 
> The only way out of this that i can see so far is if TM-based processes
> that needs to do things like mpiexec could be registred in mom in such a
> way that as long as there are such processes left it refrains from
> running scan_for_terminated when it sees termin_child.

OpenMPI should face the same type of problem...

So fixing this in torque somehow would be the prefered way...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



More information about the torqueusers mailing list