[torqueusers] Re: [Mauiusers] node health check

'Garrick Staples' garrick at clusterresources.com
Wed Nov 15 12:35:34 MST 2006


> > I used the MOM config on my own cluster at first, but maui always kills
> > jobs with "down" nodes;
> 
> Really? Is it configurable? Why not leave them alone? Let them to either
> succeed of fail. Or maybe rerun on some other node?

No, it is hardcoded into maui.  I believe it is 5 minutes with a down
node, that a job kill is sent.

 
> > and it was happening too often.  Later I choose
> > to use a cronjob that sets nodes "offline".
> 
> Maybe something like $on_health_check_error down|offline then?

There is just so much more you can do in the cronjob, like pattern
matching for specific messages and taking the appropriate action.



More information about the torqueusers mailing list