[torqueusers] RE: Health check script failure and offlining

Garrick Staples garrick at usc.edu
Fri Dec 9 00:38:33 MST 2005

On Sat, Dec 03, 2005 at 12:53:55PM -0700, Smith, Jerry Don II alleged:
> Richard,
> We wrote a cron script tht takes care of this.  But yes MOAB takes care of this all on its own, even allowing "triggers" to adjust many things (node state, reservations etc...).

There's also the "down_on_error" server parameter.  It would be pretty
easy to make an "offline_on_error" (or maybe something more flexible) if
people want it.  I recall Chris Samuel requesting a way to prevent jobs
from being scheduled on nodes that had rebooted.

I've had down_on_error enabled for the last few months, but just
recently turned it off because maui was killing off too many jobs.  We
opted for a cronjob too.

> Jerry
> All,
> I have set up a health check script in $PBS/mom_priv/config.  It works
> fine in that it sets the 'message' attribute for the problem mom/node when
> there is a failure, but how can I get the nodes status adjusted to
> 'offline'
> (pbsnodes -o nodeXXX) when the failure occurs.  The manual says that:
>   "Cluster schedulers can be configured to adjust a given node's state
>    based on this [ERROR message] information."
> Perhaps this is only a MOAB feature.

> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051208/43ec5af9/attachment.bin

More information about the torqueusers mailing list