[torqueusers] Re: [Mauiusers] node health check

Alexander Saydakov saydakov at yahoo-inc.com
Wed Nov 15 10:31:08 MST 2006


> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> Sent: Tuesday, November 14, 2006 9:39 PM
> To: torqueusers at supercluster.org
> Subject: Re: [torqueusers] Re: [Mauiusers] node health check
> 
> On Tue, Nov 14, 2006 at 02:50:30PM -0800, Alexander Saydakov alleged:
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > > Sent: Tuesday, November 14, 2006 11:52 AM
> > > To: torqueusers at supercluster.org
> > > Subject: [torqueusers] Re: [Mauiusers] node health check
> > >
> > > On Tue, Nov 14, 2006 at 10:45:43AM -0800, Alexander Saydakov alleged:
> > > > > -----Original Message-----
> > > > > From: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> > > > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > > > Sent: Monday, November 13, 2006 5:02 PM
> > > > > To: mauiusers at supercluster.org
> > > > > Subject: Re: [Mauiusers] node health check
> > > > >
> > > > > On Mon, Nov 13, 2006 at 04:10:43PM -0800, Alexander Saydakov
> alleged:
> > > > > > Hi!
> > > > > >
> > > > > > Does Maui support node health checks done by setting
> > > $node_check_script
> > > > > in
> > > > > > Torque MOM's config?
> > > > >
> > > > > Does it read the ERROR message?  no.
> > > > >
> > > > > At USC, we have a 10 minute cronjob that makes those nodes
> offline.
> > > You
> > > > > can also have MOM or the server set nodes "down" with any ERROR.
> > > >
> > > > You mean patching pbs_mom or pbs_server, don't you? I don't see any
> > > existing
> > > > support for this.
> > > >
> > > > By the way, I don't see why the decision of setting a node down
> should
> > > be
> > > > delegated to the server or scheduler. In this case node knows that
> it is
> > > > incapable of executing any jobs. In my view, it would be great if
> > > pbs_mom
> > > > could report itself as down if node check script returned non-zero
> exit
> > > > code. STDERR output could be used to report a reason. Using STDOUT
> is
> > > less
> > > > clean. All that could be configurable (like
> "down_on_node_check_failure
> > > > true" or something).
> > >
> > > In MOM's config, $down_on_error can be used to have the MOM set itself
> > > as "down" if there is an ERROR message from the health check script.
> >
> > Oh, really? I was looking at
> >
> http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:c_mom_con
> fi
> > guration, which does not have this parameter.
> >
> > > Alternatively, you can set the down_on_error server attribute in qmgr
> to
> > > do the same thing on the server.
> >
> > Again, I was looking online at
> >
> http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:b_server_
> pa
> > rameters, which does not have this parameter.
> >
> > > These are documented in the pbs_mom and pbs_server_attributes
> manpages.
> >
> > You are right, it is documented in the manpages with the note: This
> feature
> > is EXPERIMENTAL and likely to be removed in the future
> 
> I announced them to the list quite some time ago, but I never got any
> feedback :)
> 
> 
> > Are they mutually exclusive or supplement each other.
> 
> They aren't strictly mutually exclusive, but using both would be silly.
> At the time, I didn't really know which method would be better.
> 
> I used the MOM config on my own cluster at first, but maui always kills
> jobs with "down" nodes;

Really? Is it configurable? Why not leave them alone? Let them to either
succeed of fail. Or maybe rerun on some other node?

> and it was happening too often.  Later I choose
> to use a cronjob that sets nodes "offline".

Maybe something like $on_health_check_error down|offline then?




More information about the torqueusers mailing list