[torqueusers] Re: [Mauiusers] node health check
garrick at clusterresources.com
Wed Nov 15 12:35:34 MST 2006
> > I used the MOM config on my own cluster at first, but maui always kills
> > jobs with "down" nodes;
> Really? Is it configurable? Why not leave them alone? Let them to either
> succeed of fail. Or maybe rerun on some other node?
No, it is hardcoded into maui. I believe it is 5 minutes with a down
node, that a job kill is sent.
> > and it was happening too often. Later I choose
> > to use a cronjob that sets nodes "offline".
> Maybe something like $on_health_check_error down|offline then?
There is just so much more you can do in the cronjob, like pattern
matching for specific messages and taking the appropriate action.
More information about the torqueusers