[torqueusers] Removing processes after a job is killed

Glen Beane glen.beane at gmail.com
Thu Jul 3 22:27:45 MDT 2008


On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu> wrote:

> On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
> >The parallel programming environments we use (e.g. MPICH) use SSH to
> >create processes on the sister nodes.  If these jobs fail (are
> >deleted, the mother node crashes, etc), the spawned processes remain
> >on the sisters and eventually someone has to go and clean them out.
> >Is there any way to use epilogue scripts to keep track of these
> >processes and make sure they get killed properly if they need to be?
> >
>
> Because we do not place execution limits on nodes
> (users can have multiple jobs running on the same node
> and multiple users can be using the same node),
> we use an epilogue script which calls another script
> to clean up leftover processes based on the jobid.
>
> Here is the epilogue script, which runs on the mother superior node
> and executes as root.
>
>   #!/bin/bash
>   JOBID=$1
>   JOBUSER=$2
>
>   # get nodes involved in this job
>   nodelist=/var/spool/torque/aux/$JOBID
>   if [ -r $nodelist ] ; then
>     nodes=$(sort $nodelist | uniq)
>   else
>     nodes=localhost
>   fi
>
>   # for each node involved in the job
>   # kill any pids leftover from that job
>
>   for i in $nodes ; do
>     ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
>   done
>
>
> Here is the 'cleanup' script:
>
>   #!/bin/bash
>
>   # look in the /proc process structure and
>   # kill all pids associated with the passed in $JOBID
>   # this script is run as the user, not root
>
>   TOKILL=$1
>   [ -z "${TOKILL}" ] && exit 1
>   ME=`whoami`
>   cd /
>   find /proc -noleaf -maxdepth 2 -name environ -user $ME |
>   while read x; do
>     PBS_JOBID=""
>     if [ -e $x ]; then
>       pid=$(basename $(dirname $x))
>       if [ -e $x ]; then
>         eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
>         if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
>           kill -9 $pid
>         fi
>       fi
>     fi
>   done
>
>
> - Corey



such a script shoud be unnecessary if you use a TM-based job launcher for
whatever flavor of MPI you use, but I guess it doesn't hurt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080704/815e6cf1/attachment.html


More information about the torqueusers mailing list