[torqueusers] sending SIGINT to all nodes in a multi-node job
wyckoff at yahoo-inc.com
Wed Jul 18 21:44:13 MDT 2007
Unfortunately, we're not running MPI, but rather Hadoop, an open source
implementation of Map/Reduce. And it has no such capability :(
From: Garrick Staples [mailto:garrick at usc.edu]
Sent: Wednesday, July 18, 2007 8:21 PM
To: Peter Wyckoff
Cc: torqueusers at supercluster.org; Mahadev Konar; Christopher Zimmerman
Subject: Re: [torqueusers] sending SIGINT to all nodes in a multi-node
On Wed, Jul 18, 2007 at 07:50:37PM -0700, Peter Wyckoff alleged:
> We want all the nodes in a multi-node computation to let the parent
> process for a job get a SIGINT (at least) kill_delay seconds before
> SIGKILL. Right now it seems that only the MS (aka head) node gets
> I don't know how we can then send a signal to all the sister nodes?
> Moab, maybe pbsdsh mjobctl -s 15 <jobid> from the MS. But, without
> there's no way to signal a to the parent processes on the other nodes,
> is there?
> This thread in 05 said this was expected behavior. I just don't know
> given this I can cleanly shut down the sister moms ???
> Thanks, pete
Everything in that email was hypothetical, except for where I said,
"That is the expected behaviour currently. Only MS signals processes."
For your case, just have the batch script trap SIGINT and then do
whatever you want. A 'pbsdsh kill $pid' wouldn't work out because you
won't have a single pid number.
But mpiexec could certainly forward the SIGINT to its TM tasks.
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California
Please avoid sending me Word or PowerPoint attachments.
More information about the torqueusers