[torqueusers] Removing processes after a job is killed

Glen Beane glen.beane at gmail.com
Thu Jul 3 23:05:26 MDT 2008


On Fri, Jul 4, 2008 at 1:00 AM, Garrick Staples <garrick at usc.edu> wrote:

> On Fri, Jul 04, 2008 at 12:27:45AM -0400, Glen Beane alleged:
> > On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu>
> wrote:
> >
> > > On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
> > > >The parallel programming environments we use (e.g. MPICH) use SSH to
> > > >create processes on the sister nodes.  If these jobs fail (are
> > > >deleted, the mother node crashes, etc), the spawned processes remain
> > > >on the sisters and eventually someone has to go and clean them out.
> > > >Is there any way to use epilogue scripts to keep track of these
> > > >processes and make sure they get killed properly if they need to be?
> > > >
> > >
> > > Because we do not place execution limits on nodes
> > > (users can have multiple jobs running on the same node
> > > and multiple users can be using the same node),
> > > we use an epilogue script which calls another script
> > > to clean up leftover processes based on the jobid.
> > >
> > > Here is the epilogue script, which runs on the mother superior node
> > > and executes as root.
> > >
> > >   #!/bin/bash
> > >   JOBID=$1
> > >   JOBUSER=$2
> > >
> > >   # get nodes involved in this job
> > >   nodelist=/var/spool/torque/aux/$JOBID
> > >   if [ -r $nodelist ] ; then
> > >     nodes=$(sort $nodelist | uniq)
> > >   else
> > >     nodes=localhost
> > >   fi
> > >
> > >   # for each node involved in the job
> > >   # kill any pids leftover from that job
> > >
> > >   for i in $nodes ; do
> > >     ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
> > >   done
> > >
> > >
> > > Here is the 'cleanup' script:
> > >
> > >   #!/bin/bash
> > >
> > >   # look in the /proc process structure and
> > >   # kill all pids associated with the passed in $JOBID
> > >   # this script is run as the user, not root
> > >
> > >   TOKILL=$1
> > >   [ -z "${TOKILL}" ] && exit 1
> > >   ME=`whoami`
> > >   cd /
> > >   find /proc -noleaf -maxdepth 2 -name environ -user $ME |
> > >   while read x; do
> > >     PBS_JOBID=""
> > >     if [ -e $x ]; then
> > >       pid=$(basename $(dirname $x))
> > >       if [ -e $x ]; then
> > >         eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
> > >         if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
> > >           kill -9 $pid
> > >         fi
> > >       fi
> > >     fi
> > >   done
> > >
> > >
> > > - Corey
> >
> >
> >
> > such a script shoud be unnecessary if you use a TM-based job launcher for
> > whatever flavor of MPI you use, but I guess it doesn't hurt
>
> Would the script even work?  Remote shell processes won't have PBS_JOBID in
> the
> environment.



Thats true, it probably only finds PBS_JOBID in the environment if it was
launched with TM in the first place!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080704/0200c7b4/attachment-0001.html


More information about the torqueusers mailing list