[torqueusers] Removing processes after a job is killed
coreyf at CLEMSON.EDU
Fri Jul 4 06:54:11 MDT 2008
On Fri, Jul 04, 2008 at 01:05:26AM -0400, Glen Beane wrote:
>On Fri, Jul 4, 2008 at 1:00 AM, Garrick Staples <garrick at usc.edu> wrote:
>> On Fri, Jul 04, 2008 at 12:27:45AM -0400, Glen Beane alleged:
>> > On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu>
>> > > On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
>> > > >The parallel programming environments we use (e.g. MPICH) use SSH to
>> > > >create processes on the sister nodes. If these jobs fail (are
>> > > >deleted, the mother node crashes, etc), the spawned processes remain
>> > > >on the sisters and eventually someone has to go and clean them out.
>> > > >Is there any way to use epilogue scripts to keep track of these
>> > > >processes and make sure they get killed properly if they need to be?
>> > > >
>> > >
>> > > Because we do not place execution limits on nodes
>> > > (users can have multiple jobs running on the same node
>> > > and multiple users can be using the same node),
>> > > we use an epilogue script which calls another script
>> > > to clean up leftover processes based on the jobid.
>> > >
>> > such a script shoud be unnecessary if you use a TM-based job launcher for
>> > whatever flavor of MPI you use, but I guess it doesn't hurt
>> Would the script even work? Remote shell processes won't have PBS_JOBID in
>Thats true, it probably only finds PBS_JOBID in the environment if it was
>launched with TM in the first place!
Yes this script cleans up leftovers based on the jobid. I didn't read
the original post closely enough to see that the jobid wouldn't be
there when using ssh. Sorry about that!
Corey Ferrier coreyf at clemson.edu
HPC Group, CCIT, Clemson University 864-656-2790
340 Computer Court, Anderson, SC, USA 29625
More information about the torqueusers