[torqueusers] Re: [Mauiusers] node health check

Sam Rash srash at yahoo-inc.com
Wed Nov 15 12:46:26 MST 2006


In all this I'm a bit lost--how can I rely on a host that's not on (it burnt
to a crisp) to set itself as 'down'?  the server clearly has to have some
logic to set hosts as down (even if it is a simple tcp connect/ping).
Having a node that is 'there', but whose health check script fails set
itself 'down' is fine as an option, but by default why separate out the
logic?  Shouldn't a 'resource manager' manage the resources?

I concur using the exit code to indicate yes/no and then stderr with ERROR
to indicate the verbosity is ideal.

I also had questions on pbsdsh that weren't clear; Is this what most people
use to do parallel tasks?  It does not seem to support any mechanism to copy
a program/binary/input/output for staging...



Sam Rash
srash at yahoo-inc.com
408-349-7312
vertigosr37

-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Alexander
Saydakov
Sent: Wednesday, November 15, 2006 9:36 AM
To: 'Garrick Staples'; torqueusers at supercluster.org
Subject: RE: [torqueusers] Re: [Mauiusers] node health check

> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> Sent: Tuesday, November 14, 2006 9:41 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Re: [Mauiusers] node health check
> 
> On Tue, Nov 14, 2006 at 03:53:11PM -0800, Alexander Saydakov alleged:
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > > Sent: Tuesday, November 14, 2006 11:52 AM
> > > To: torqueusers at supercluster.org
> > > Subject: [torqueusers] Re: [Mauiusers] node health check
> > >
> > > In MOM's config, $down_on_error can be used to have the MOM set itself
> > > as "down" if there is an ERROR message from the health check script.
> >
> > I would suggest taking advantage of the exit code instead of relying on
> the
> > message to begin with ERROR. What if things are so out of hand that
> health
> > script can not even execute? I understand that server can only read the
> > message from mom, but mom is in a better position because it has the
> exit
> > code of the script. Why not take non-zero exit code as an indication of
> the
> > problem?
> 
> That is probably a good idea.  We keep the current behaviour, but in
> addition a non-zero exit will create a "ERROR health check failed"
> message?

We can still use the output of the script (STDERR or STDOUT?) as a message,
adding ERROR if it does not start with it.


_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers




More information about the torqueusers mailing list