[torqueusers] Re: [Mauiusers] node health check

Alexander Saydakov saydakov at yahoo-inc.com
Tue Nov 14 15:50:30 MST 2006


> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> Sent: Tuesday, November 14, 2006 11:52 AM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Re: [Mauiusers] node health check
> 
> On Tue, Nov 14, 2006 at 10:45:43AM -0800, Alexander Saydakov alleged:
> > > -----Original Message-----
> > > From: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > Sent: Monday, November 13, 2006 5:02 PM
> > > To: mauiusers at supercluster.org
> > > Subject: Re: [Mauiusers] node health check
> > >
> > > On Mon, Nov 13, 2006 at 04:10:43PM -0800, Alexander Saydakov alleged:
> > > > Hi!
> > > >
> > > > Does Maui support node health checks done by setting
> $node_check_script
> > > in
> > > > Torque MOM's config?
> > >
> > > Does it read the ERROR message?  no.
> > >
> > > At USC, we have a 10 minute cronjob that makes those nodes offline.
> You
> > > can also have MOM or the server set nodes "down" with any ERROR.
> >
> > You mean patching pbs_mom or pbs_server, don't you? I don't see any
> existing
> > support for this.
> >
> > By the way, I don't see why the decision of setting a node down should
> be
> > delegated to the server or scheduler. In this case node knows that it is
> > incapable of executing any jobs. In my view, it would be great if
> pbs_mom
> > could report itself as down if node check script returned non-zero exit
> > code. STDERR output could be used to report a reason. Using STDOUT is
> less
> > clean. All that could be configurable (like "down_on_node_check_failure
> > true" or something).
> 
> In MOM's config, $down_on_error can be used to have the MOM set itself
> as "down" if there is an ERROR message from the health check script.

Oh, really? I was looking at
http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:c_mom_confi
guration, which does not have this parameter.

> Alternatively, you can set the down_on_error server attribute in qmgr to
> do the same thing on the server.

Again, I was looking online at
http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:b_server_pa
rameters, which does not have this parameter.

> These are documented in the pbs_mom and pbs_server_attributes manpages.

You are right, it is documented in the manpages with the note: This feature
is EXPERIMENTAL and likely to be removed in the future

Are they mutually exclusive or supplement each other.

Thanks a lot.




More information about the torqueusers mailing list