[torqueusers] Detecting node flush

Kevin Van Workum vanw at tticluster.com
Thu Feb 22 13:51:48 MST 2007


You do something in a prologue and/or epilogue script. The UNTESTED
script below would do it, but not very cleanly or robustly.

#!/bin/bash

TMAX=4      # threshold time between jobs (seconds)
NMAX=5      # number of jobs before we shutdown mom
TFILE=N.dat # file to store N

if [ ! -f $TFILE ]; then
    echo 0 > $TFILE
fi

# get current time
CTIME=`date +%s`

# check mod time of our $TFILE file.
MTIME=`stat -t $TFILE | cut -f 13 -d ' '`

DT=$(($CTIME-$MTIME))

if [ "$DT" -lt "$TMAX" ]; then
    N=`cat $TFILE`
    N=$(($N+1))
    if [ "$N" -gt "$NMAX" ]; then
        momctl -h localhost -s
    fi
else
    N=0
fi

echo $N > $TFILE


On 2/22/07, Chris Evert <chris at geodev.com> wrote:
> Is there a setting in torque to recognize (and take offline) a node that
> completes jobs in rapid succession?  I'm assuming that there is
> something wrong with the node and all the jobs are failing immediately
> and no more jobs should be submitted to the until the problem is resolved.
>
> Is the healthcheck and experimental "please take me offline" mom
> configuration the only way to do this?
>
> Thanks,
> Chris
> --
> Chris Evert
> Geophysical Development Corporation
> Houston, TX
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Kevin Van Workum, Ph.D.
Vice President
Senior System Administrator
www.clusterondemand.com
ONLINE COMPUTER CLUSTERS


More information about the torqueusers mailing list