[torqueusers] killing over limit jobs is unfriendly to mpiexec

Garrick Staples garrick at clusterresources.com
Fri Nov 24 19:47:45 MST 2006

On Thu, Nov 23, 2006 at 03:48:30PM -0500, Pete Wyckoff alleged:
> I'm trying to figure out why mpiexec isn't catching the exit
> statuses of the tasks when a job goes over a limit, like walltime.
> Mom sends SIGTERM to all the processes in the job.  Mpiexec catches
> the signal, sends tm_kill() to all tasks and waits for them to exit.
> The top-level shell, meanwhile, does not catch the signal, and
> exits.  This triggers code in scan_for_terminated to mark the task
> TI_STATE_EXITED and to send another SIGTERM to all the remaining
> tasks.
> Mpiexec catches this second SIGTERM and just exits, abandoning any
> tasks.  The thought was that when users hit ctrl-c, it tries to
> clean tasks up nicely, but if the batch system has hosed itself, a
> second tap of ctrl-c will force mpiexec to exit.  If I were to
> ignore future SIGTERMs, users would have to hit ctrl-z, then "kill
> -9" the process to get it to go away.

mpiexec can still trap SIGTERM, while still exiting on SIGINT (ctrl-c
sends SIGINT).

> However, I can hack/fix mpiexec to keep waiting across the second
> SIGTERM, but it still does not get the proper TM obit messages,
> because mom's scan_for_exiting() sets ptask->ti_fd to -1.  This
> causes task_check() to complain "cannot tm_reply to task 1" rather
> than send the TM message.  Commenting out that set of ti_fd does
> not change the behavior, because kill_task() sits in a tight loop
> for 4 sec waiting for the task to die rather than delivering the
> queued up obits.  Eventually everything dies with SIGKILL.

I'm not happy with that tight loop either, but mostly because it becomes
painfully slow when you have very large numbers of tasks and each one
exits slowly.

And of course, the eventual SIGKILL can be delayed with the kill_delay
attribute on the queue or server.

> Everything does work nicely, though, if I ignore the SIGTERM in the
> top-level shell:
>     trap "echo Job shell caught TERM, ignoring >&2" TERM
>     mpiexec a.out
> Works brilliantly, unmodified.  But I'd hate to force users to do
> this to get the right behavior.
> Any ideas how to fix this in torque?  That loop in kill_task() is
> new compared to good-old PBS.  I'm fishing for thoughts at this
> point.  The behavior can always be papered over by not reporting
> exit values when they are missing, but a clean solution would be
> better.

This is tough because the current behaviour solved a problem with
killing uncooperative jobs quicker.

More information about the torqueusers mailing list