[torqueusers] Removing processes after a job is killed

Corey Ferrier coreyf at CLEMSON.EDU
Fri Jul 4 06:54:11 MDT 2008


On Fri, Jul 04, 2008 at 01:05:26AM -0400, Glen Beane wrote:
>On Fri, Jul 4, 2008 at 1:00 AM, Garrick Staples <garrick at usc.edu> wrote:
>
>> On Fri, Jul 04, 2008 at 12:27:45AM -0400, Glen Beane alleged:
>> > On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu>
>> wrote:
>> >
>> > > On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
>> > > >The parallel programming environments we use (e.g. MPICH) use SSH to
>> > > >create processes on the sister nodes.  If these jobs fail (are
>> > > >deleted, the mother node crashes, etc), the spawned processes remain
>> > > >on the sisters and eventually someone has to go and clean them out.
>> > > >Is there any way to use epilogue scripts to keep track of these
>> > > >processes and make sure they get killed properly if they need to be?
>> > > >
>> > >
>> > > Because we do not place execution limits on nodes
>> > > (users can have multiple jobs running on the same node
>> > > and multiple users can be using the same node),
>> > > we use an epilogue script which calls another script
>> > > to clean up leftover processes based on the jobid.
>> > >
>> > such a script shoud be unnecessary if you use a TM-based job launcher for
>> > whatever flavor of MPI you use, but I guess it doesn't hurt
>>
>> Would the script even work?  Remote shell processes won't have PBS_JOBID in
>> the
>> environment.
>
>Thats true, it probably only finds PBS_JOBID in the environment if it was
>launched with TM in the first place!

Yes this script cleans up leftovers based on the jobid.  I didn't read
the original post closely enough to see that the jobid wouldn't be
there when using ssh.   Sorry about that! 

- Corey

--
Corey Ferrier                               coreyf at clemson.edu
HPC Group, CCIT, Clemson University               864-656-2790
340 Computer Court, Anderson, SC, USA 29625         


More information about the torqueusers mailing list