[torqueusers] Removing processes after a job is killed

Garrick Staples garrick at usc.edu
Thu Jul 3 23:00:54 MDT 2008


On Fri, Jul 04, 2008 at 12:27:45AM -0400, Glen Beane alleged:
> On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu> wrote:
> 
> > On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
> > >The parallel programming environments we use (e.g. MPICH) use SSH to
> > >create processes on the sister nodes.  If these jobs fail (are
> > >deleted, the mother node crashes, etc), the spawned processes remain
> > >on the sisters and eventually someone has to go and clean them out.
> > >Is there any way to use epilogue scripts to keep track of these
> > >processes and make sure they get killed properly if they need to be?
> > >
> >
> > Because we do not place execution limits on nodes
> > (users can have multiple jobs running on the same node
> > and multiple users can be using the same node),
> > we use an epilogue script which calls another script
> > to clean up leftover processes based on the jobid.
> >
> > Here is the epilogue script, which runs on the mother superior node
> > and executes as root.
> >
> >   #!/bin/bash
> >   JOBID=$1
> >   JOBUSER=$2
> >
> >   # get nodes involved in this job
> >   nodelist=/var/spool/torque/aux/$JOBID
> >   if [ -r $nodelist ] ; then
> >     nodes=$(sort $nodelist | uniq)
> >   else
> >     nodes=localhost
> >   fi
> >
> >   # for each node involved in the job
> >   # kill any pids leftover from that job
> >
> >   for i in $nodes ; do
> >     ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
> >   done
> >
> >
> > Here is the 'cleanup' script:
> >
> >   #!/bin/bash
> >
> >   # look in the /proc process structure and
> >   # kill all pids associated with the passed in $JOBID
> >   # this script is run as the user, not root
> >
> >   TOKILL=$1
> >   [ -z "${TOKILL}" ] && exit 1
> >   ME=`whoami`
> >   cd /
> >   find /proc -noleaf -maxdepth 2 -name environ -user $ME |
> >   while read x; do
> >     PBS_JOBID=""
> >     if [ -e $x ]; then
> >       pid=$(basename $(dirname $x))
> >       if [ -e $x ]; then
> >         eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
> >         if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
> >           kill -9 $pid
> >         fi
> >       fi
> >     fi
> >   done
> >
> >
> > - Corey
> 
> 
> 
> such a script shoud be unnecessary if you use a TM-based job launcher for
> whatever flavor of MPI you use, but I guess it doesn't hurt

Would the script even work?  Remote shell processes won't have PBS_JOBID in the
environment.

-- 
Garrick Staples, GNU/Linux HPCC SysAdmin
University of Southern California

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080703/c35af9d2/attachment.bin


More information about the torqueusers mailing list