[torqueusers] Removing processes after a job is killed
Corey Ferrier
coreyf at CLEMSON.EDU
Wed Jul 2 12:07:54 MDT 2008
On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
>The parallel programming environments we use (e.g. MPICH) use SSH to
>create processes on the sister nodes. If these jobs fail (are
>deleted, the mother node crashes, etc), the spawned processes remain
>on the sisters and eventually someone has to go and clean them out.
>Is there any way to use epilogue scripts to keep track of these
>processes and make sure they get killed properly if they need to be?
>
Because we do not place execution limits on nodes
(users can have multiple jobs running on the same node
and multiple users can be using the same node),
we use an epilogue script which calls another script
to clean up leftover processes based on the jobid.
Here is the epilogue script, which runs on the mother superior node
and executes as root.
#!/bin/bash
JOBID=$1
JOBUSER=$2
# get nodes involved in this job
nodelist=/var/spool/torque/aux/$JOBID
if [ -r $nodelist ] ; then
nodes=$(sort $nodelist | uniq)
else
nodes=localhost
fi
# for each node involved in the job
# kill any pids leftover from that job
for i in $nodes ; do
ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
done
Here is the 'cleanup' script:
#!/bin/bash
# look in the /proc process structure and
# kill all pids associated with the passed in $JOBID
# this script is run as the user, not root
TOKILL=$1
[ -z "${TOKILL}" ] && exit 1
ME=`whoami`
cd /
find /proc -noleaf -maxdepth 2 -name environ -user $ME |
while read x; do
PBS_JOBID=""
if [ -e $x ]; then
pid=$(basename $(dirname $x))
if [ -e $x ]; then
eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
kill -9 $pid
fi
fi
fi
done
- Corey
--
Corey Ferrier coreyf at clemson.edu
HPC Group, CCIT, Clemson University 864-656-2790
340 Computer Court, Anderson, SC, USA 29625
More information about the torqueusers
mailing list