[torqueusers] Detecting node flush
Kevin Van Workum
vanw at tticluster.com
Thu Feb 22 13:51:48 MST 2007
You do something in a prologue and/or epilogue script. The UNTESTED
script below would do it, but not very cleanly or robustly.
#!/bin/bash
TMAX=4 # threshold time between jobs (seconds)
NMAX=5 # number of jobs before we shutdown mom
TFILE=N.dat # file to store N
if [ ! -f $TFILE ]; then
echo 0 > $TFILE
fi
# get current time
CTIME=`date +%s`
# check mod time of our $TFILE file.
MTIME=`stat -t $TFILE | cut -f 13 -d ' '`
DT=$(($CTIME-$MTIME))
if [ "$DT" -lt "$TMAX" ]; then
N=`cat $TFILE`
N=$(($N+1))
if [ "$N" -gt "$NMAX" ]; then
momctl -h localhost -s
fi
else
N=0
fi
echo $N > $TFILE
On 2/22/07, Chris Evert <chris at geodev.com> wrote:
> Is there a setting in torque to recognize (and take offline) a node that
> completes jobs in rapid succession? I'm assuming that there is
> something wrong with the node and all the jobs are failing immediately
> and no more jobs should be submitted to the until the problem is resolved.
>
> Is the healthcheck and experimental "please take me offline" mom
> configuration the only way to do this?
>
> Thanks,
> Chris
> --
> Chris Evert
> Geophysical Development Corporation
> Houston, TX
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
--
Kevin Van Workum, Ph.D.
Vice President
Senior System Administrator
www.clusterondemand.com
ONLINE COMPUTER CLUSTERS
More information about the torqueusers
mailing list