[torqueusers] Re: [Mauiusers] node health check

'Garrick Staples' garrick at clusterresources.com
Wed Nov 15 12:37:08 MST 2006


On Wed, Nov 15, 2006 at 09:35:47AM -0800, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > Sent: Tuesday, November 14, 2006 9:41 PM
> > To: torqueusers at supercluster.org
> > Subject: Re: [torqueusers] Re: [Mauiusers] node health check
> > 
> > On Tue, Nov 14, 2006 at 03:53:11PM -0800, Alexander Saydakov alleged:
> > > > -----Original Message-----
> > > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > > > Sent: Tuesday, November 14, 2006 11:52 AM
> > > > To: torqueusers at supercluster.org
> > > > Subject: [torqueusers] Re: [Mauiusers] node health check
> > > >
> > > > In MOM's config, $down_on_error can be used to have the MOM set itself
> > > > as "down" if there is an ERROR message from the health check script.
> > >
> > > I would suggest taking advantage of the exit code instead of relying on
> > the
> > > message to begin with ERROR. What if things are so out of hand that
> > health
> > > script can not even execute? I understand that server can only read the
> > > message from mom, but mom is in a better position because it has the
> > exit
> > > code of the script. Why not take non-zero exit code as an indication of
> > the
> > > problem?
> > 
> > That is probably a good idea.  We keep the current behaviour, but in
> > addition a non-zero exit will create a "ERROR health check failed"
> > message?
> 
> We can still use the output of the script (STDERR or STDOUT?) as a message,
> adding ERROR if it does not start with it.

We can do that.  I'd say that a zero exit code means we take whatever
"ERROR message" is printed on stdout (not adding an ERROR), but reading
a message on stderr on non-zero exit.




More information about the torqueusers mailing list