[torqueusers] Re: [Mauiusers] node health check
srash at yahoo-inc.com
Wed Nov 15 12:46:26 MST 2006
In all this I'm a bit lost--how can I rely on a host that's not on (it burnt
to a crisp) to set itself as 'down'? the server clearly has to have some
logic to set hosts as down (even if it is a simple tcp connect/ping).
Having a node that is 'there', but whose health check script fails set
itself 'down' is fine as an option, but by default why separate out the
logic? Shouldn't a 'resource manager' manage the resources?
I concur using the exit code to indicate yes/no and then stderr with ERROR
to indicate the verbosity is ideal.
I also had questions on pbsdsh that weren't clear; Is this what most people
use to do parallel tasks? It does not seem to support any mechanism to copy
a program/binary/input/output for staging...
srash at yahoo-inc.com
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Alexander
Sent: Wednesday, November 15, 2006 9:36 AM
To: 'Garrick Staples'; torqueusers at supercluster.org
Subject: RE: [torqueusers] Re: [Mauiusers] node health check
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> Sent: Tuesday, November 14, 2006 9:41 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Re: [Mauiusers] node health check
> On Tue, Nov 14, 2006 at 03:53:11PM -0800, Alexander Saydakov alleged:
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > > Sent: Tuesday, November 14, 2006 11:52 AM
> > > To: torqueusers at supercluster.org
> > > Subject: [torqueusers] Re: [Mauiusers] node health check
> > >
> > > In MOM's config, $down_on_error can be used to have the MOM set itself
> > > as "down" if there is an ERROR message from the health check script.
> > I would suggest taking advantage of the exit code instead of relying on
> > message to begin with ERROR. What if things are so out of hand that
> > script can not even execute? I understand that server can only read the
> > message from mom, but mom is in a better position because it has the
> > code of the script. Why not take non-zero exit code as an indication of
> > problem?
> That is probably a good idea. We keep the current behaviour, but in
> addition a non-zero exit will create a "ERROR health check failed"
We can still use the output of the script (STDERR or STDOUT?) as a message,
adding ERROR if it does not start with it.
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers