[torqueusers] restarting a previously checkpointed job

Michael Wheatley michaelgw at gmail.com
Mon Mar 7 17:15:19 MST 2005


Hello all,
I have a small linux cluster that only operates 14 hours of a day, the
remaining it reverts back into a MS teaching lab. As a result
checkpoint restart is vital to normal operation.  My code can
checkpoint with relative ease (and by itself) and if the code is
re-run it checks for the checkpoint file and picks up from there.  I
kill/restart by hand but I need some help with my PBS script.

My current script is fairly rudimentary.  You can see that I am using
the ssh boot schema for lam-mpi rather than tm, that is a story for
another time.  What I think I need in this script is a section that
realises that this is a restart and runs again?

Cheers

Mike

***************************************************************************
#!/bin/bash
EXECUTABLE=demMP
EXECUTABLEDIR=/home/michaelw/code

# this bit will need to be rerun on a restart####################


WORKDIR=/scratch/PBS_$PBS_JOBID
mkdir $WORKDIR

cat  $PBS_NODEFILE >  $WORKDIR/hosts
cd $EXECUTABLEDIR
lamboot -v -ssi boot rsh $WORKDIR/hosts
mpirun C ./demMP 
lamhalt
# end of this bit ##########################################

rm -r -f $WORKDIR


More information about the torqueusers mailing list