[torqueusers] sending SIGINT to all nodes in a multi-node job

Peter Wyckoff wyckoff at yahoo-inc.com
Wed Jul 18 21:44:13 MDT 2007


Hi Garrick,

Unfortunately, we're not running MPI, but rather Hadoop, an open source
implementation of Map/Reduce.  And it has no such capability :(

Thanks, pete

-----Original Message-----
From: Garrick Staples [mailto:garrick at usc.edu] 
Sent: Wednesday, July 18, 2007 8:21 PM
To: Peter Wyckoff
Cc: torqueusers at supercluster.org; Mahadev Konar; Christopher Zimmerman
Subject: Re: [torqueusers] sending SIGINT to all nodes in a multi-node
job

On Wed, Jul 18, 2007 at 07:50:37PM -0700, Peter Wyckoff alleged:
> 
> We want all the nodes in a multi-node computation to let the parent
> process for a job get a SIGINT (at least) kill_delay seconds before
the
> SIGKILL. Right now it seems that only the MS (aka head) node gets
this. 
> 
> I don't know how we can then send a signal to all the sister nodes?
With
> Moab, maybe pbsdsh mjobctl -s 15 <jobid> from the MS. But, without
moab,
> there's no way to signal a to the parent processes on the other nodes,
> is there?
> 
> This thread in 05 said this was expected behavior. I just don't know
how
> given this I can cleanly shut down the sister moms ???
> 
> Thanks, pete
> 
> 
>
http://www.clusterresources.com/pipermail/torqueusers/2005-September/002
156.html

Everything in that email was hypothetical, except for where I said,
"That is the expected behaviour currently.  Only MS signals processes."

For your case, just have the batch script trap SIGINT and then do
whatever you want.  A 'pbsdsh kill $pid' wouldn't work out because you
won't have a single pid number.

But mpiexec could certainly forward the SIGINT to its TM tasks.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html


More information about the torqueusers mailing list