[torqueusers] killing over limit jobs is unfriendly to mpiexec

Pete Wyckoff pw at osc.edu
Thu Nov 23 13:48:30 MST 2006

I'm trying to figure out why mpiexec isn't catching the exit
statuses of the tasks when a job goes over a limit, like walltime.

Mom sends SIGTERM to all the processes in the job.  Mpiexec catches
the signal, sends tm_kill() to all tasks and waits for them to exit.

The top-level shell, meanwhile, does not catch the signal, and
exits.  This triggers code in scan_for_terminated to mark the task
TI_STATE_EXITED and to send another SIGTERM to all the remaining

Mpiexec catches this second SIGTERM and just exits, abandoning any
tasks.  The thought was that when users hit ctrl-c, it tries to
clean tasks up nicely, but if the batch system has hosed itself, a
second tap of ctrl-c will force mpiexec to exit.  If I were to
ignore future SIGTERMs, users would have to hit ctrl-z, then "kill
-9" the process to get it to go away.

However, I can hack/fix mpiexec to keep waiting across the second
SIGTERM, but it still does not get the proper TM obit messages,
because mom's scan_for_exiting() sets ptask->ti_fd to -1.  This
causes task_check() to complain "cannot tm_reply to task 1" rather
than send the TM message.  Commenting out that set of ti_fd does
not change the behavior, because kill_task() sits in a tight loop
for 4 sec waiting for the task to die rather than delivering the
queued up obits.  Eventually everything dies with SIGKILL.

Everything does work nicely, though, if I ignore the SIGTERM in the
top-level shell:

    trap "echo Job shell caught TERM, ignoring >&2" TERM
    mpiexec a.out

Works brilliantly, unmodified.  But I'd hate to force users to do
this to get the right behavior.

Any ideas how to fix this in torque?  That loop in kill_task() is
new compared to good-old PBS.  I'm fishing for thoughts at this
point.  The behavior can always be papered over by not reporting
exit values when they are missing, but a clean solution would be

		-- Pete

More information about the torqueusers mailing list