[torqueusers] Detecting node flush
Kevin Van Workum
vanw at tticluster.com
Thu Feb 22 13:51:48 MST 2007
You do something in a prologue and/or epilogue script. The UNTESTED
script below would do it, but not very cleanly or robustly.
TMAX=4 # threshold time between jobs (seconds)
NMAX=5 # number of jobs before we shutdown mom
TFILE=N.dat # file to store N
if [ ! -f $TFILE ]; then
echo 0 > $TFILE
# get current time
# check mod time of our $TFILE file.
MTIME=`stat -t $TFILE | cut -f 13 -d ' '`
if [ "$DT" -lt "$TMAX" ]; then
if [ "$N" -gt "$NMAX" ]; then
momctl -h localhost -s
echo $N > $TFILE
On 2/22/07, Chris Evert <chris at geodev.com> wrote:
> Is there a setting in torque to recognize (and take offline) a node that
> completes jobs in rapid succession? I'm assuming that there is
> something wrong with the node and all the jobs are failing immediately
> and no more jobs should be submitted to the until the problem is resolved.
> Is the healthcheck and experimental "please take me offline" mom
> configuration the only way to do this?
> Chris Evert
> Geophysical Development Corporation
> Houston, TX
> torqueusers mailing list
> torqueusers at supercluster.org
Kevin Van Workum, Ph.D.
Senior System Administrator
ONLINE COMPUTER CLUSTERS
More information about the torqueusers