[torqueusers] Removing processes after a job is killed
Glen Beane
glen.beane at gmail.com
Thu Jul 3 23:05:26 MDT 2008
On Fri, Jul 4, 2008 at 1:00 AM, Garrick Staples <garrick at usc.edu> wrote:
> On Fri, Jul 04, 2008 at 12:27:45AM -0400, Glen Beane alleged:
> > On Wed, Jul 2, 2008 at 2:07 PM, Corey Ferrier <coreyf at clemson.edu>
> wrote:
> >
> > > On Wed, Jul 02, 2008 at 10:39:23AM -0700, David Sheen wrote:
> > > >The parallel programming environments we use (e.g. MPICH) use SSH to
> > > >create processes on the sister nodes. If these jobs fail (are
> > > >deleted, the mother node crashes, etc), the spawned processes remain
> > > >on the sisters and eventually someone has to go and clean them out.
> > > >Is there any way to use epilogue scripts to keep track of these
> > > >processes and make sure they get killed properly if they need to be?
> > > >
> > >
> > > Because we do not place execution limits on nodes
> > > (users can have multiple jobs running on the same node
> > > and multiple users can be using the same node),
> > > we use an epilogue script which calls another script
> > > to clean up leftover processes based on the jobid.
> > >
> > > Here is the epilogue script, which runs on the mother superior node
> > > and executes as root.
> > >
> > > #!/bin/bash
> > > JOBID=$1
> > > JOBUSER=$2
> > >
> > > # get nodes involved in this job
> > > nodelist=/var/spool/torque/aux/$JOBID
> > > if [ -r $nodelist ] ; then
> > > nodes=$(sort $nodelist | uniq)
> > > else
> > > nodes=localhost
> > > fi
> > >
> > > # for each node involved in the job
> > > # kill any pids leftover from that job
> > >
> > > for i in $nodes ; do
> > > ssh $i "su -c '/var/spool/torque/mom_priv/cleanup $JOBID' $JOBUSER"
> > > done
> > >
> > >
> > > Here is the 'cleanup' script:
> > >
> > > #!/bin/bash
> > >
> > > # look in the /proc process structure and
> > > # kill all pids associated with the passed in $JOBID
> > > # this script is run as the user, not root
> > >
> > > TOKILL=$1
> > > [ -z "${TOKILL}" ] && exit 1
> > > ME=`whoami`
> > > cd /
> > > find /proc -noleaf -maxdepth 2 -name environ -user $ME |
> > > while read x; do
> > > PBS_JOBID=""
> > > if [ -e $x ]; then
> > > pid=$(basename $(dirname $x))
> > > if [ -e $x ]; then
> > > eval $(cat $x | tr '\0' '\n' | grep PBS_JOBID)
> > > if [ "${PBS_JOBID}" == "${TOKILL}" ]; then
> > > kill -9 $pid
> > > fi
> > > fi
> > > fi
> > > done
> > >
> > >
> > > - Corey
> >
> >
> >
> > such a script shoud be unnecessary if you use a TM-based job launcher for
> > whatever flavor of MPI you use, but I guess it doesn't hurt
>
> Would the script even work? Remote shell processes won't have PBS_JOBID in
> the
> environment.
Thats true, it probably only finds PBS_JOBID in the environment if it was
launched with TM in the first place!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080704/0200c7b4/attachment-0001.html
More information about the torqueusers
mailing list