[torqueusers] Removing processes after a job is killed

Corey Ferrier coreyf at CLEMSON.EDU
Wed Jul 2 12:07:54 MDT 2008


On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
>The parallel programming environments we use (e.g. MPICH) use SSH to
>create processes on the sister nodes.  If these jobs fail (are
>deleted, the mother node crashes, etc), the spawned processes remain
>on the sisters and eventually someone has to go and clean them out.
>Is there any way to use epilogue scripts to keep track of these
>processes and make sure they get killed properly if they need to be?
>

Because we do not place execution limits on nodes
(users can have multiple jobs running on the same node
and multiple users can be using the same node),
we use an epilogue script which calls another script
to clean up leftover processes based on the jobid.

Here is the epilogue script, which runs on the mother superior node 
and executes as root.  

   #!/bin/bash
   JOBID=$1
   JOBUSER=$2

   # get nodes involved in this job
   nodelist=/var/spool/torque/aux/$JOBID
   if [ -r $nodelist ] ; then
     nodes=$(sort $nodelist | uniq)
   else
     nodes=localhost
   fi

   # for each node involved in the job
   # kill any pids leftover from that job

   for i in $nodes ; do
     ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
   done


Here is the 'cleanup' script:

   #!/bin/bash

   # look in the /proc process structure and 
   # kill all pids associated with the passed in $JOBID 
   # this script is run as the user, not root

   TOKILL=$1
   [ -z "${TOKILL}" ] && exit 1
   ME=`whoami`
   cd /
   find /proc -noleaf -maxdepth 2 -name environ -user $ME |
   while read x; do
     PBS_JOBID=""
     if [ -e $x ]; then
       pid=$(basename $(dirname $x))
       if [ -e $x ]; then
         eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
         if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
           kill -9 $pid
         fi
       fi
     fi
   done


- Corey

--
Corey Ferrier                               coreyf at clemson.edu
HPC Group, CCIT, Clemson University               864-656-2790
340 Computer Court, Anderson, SC, USA 29625         



More information about the torqueusers mailing list