[torqueusers] Re: [Mauiusers] node health check

'Garrick Staples' garrick at clusterresources.com
Wed Nov 15 08:50:49 MST 2006


On Wed, Nov 15, 2006 at 07:29:31AM +0100, ?ke Sandgren alleged:
> On Tue, 2006-11-14 at 22:38 -0700, 'Garrick Staples' wrote:
> > I used the MOM config on my own cluster at first, but maui always kills
> > jobs with "down" nodes; and it was happening too often.  Later I choose
> > to use a cronjob that sets nodes "offline".
> 
> So why not have an option to let the mom put itself offline instead of
> down?
 
So far, nothing inside of TORQUE sets offline, it is always done by a
sysadmin (or a script written by a sysadmin).  Me resisting that change
is one reason.

The other reason is problems with clearing the offline bit.  If we had
pbs_server set offline, it couldn't clear it again because we'd have no
way to distinguish between a server-set offline, or a sysadmin-set
offline.  It would be one-way.

The one-way setting is also complicated by possible transient errors.
large numbers could be set offline for some silly transient thing and
require lots of cleanup from the sysadmin. 

With these issues, a cronjob is nice an easy :)




More information about the torqueusers mailing list