[torqueusers] Re: [Mauiusers] node health check

'Garrick Staples' garrick at clusterresources.com
Tue Nov 14 22:38:52 MST 2006


On Tue, Nov 14, 2006 at 02:50:30PM -0800, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
> > bounces at supercluster.org] On Behalf Of 'Garrick Staples'
> > Sent: Tuesday, November 14, 2006 11:52 AM
> > To: torqueusers at supercluster.org
> > Subject: [torqueusers] Re: [Mauiusers] node health check
> > 
> > On Tue, Nov 14, 2006 at 10:45:43AM -0800, Alexander Saydakov alleged:
> > > > -----Original Message-----
> > > > From: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> > > > bounces at supercluster.org] On Behalf Of Garrick Staples
> > > > Sent: Monday, November 13, 2006 5:02 PM
> > > > To: mauiusers at supercluster.org
> > > > Subject: Re: [Mauiusers] node health check
> > > >
> > > > On Mon, Nov 13, 2006 at 04:10:43PM -0800, Alexander Saydakov alleged:
> > > > > Hi!
> > > > >
> > > > > Does Maui support node health checks done by setting
> > $node_check_script
> > > > in
> > > > > Torque MOM's config?
> > > >
> > > > Does it read the ERROR message?  no.
> > > >
> > > > At USC, we have a 10 minute cronjob that makes those nodes offline.
> > You
> > > > can also have MOM or the server set nodes "down" with any ERROR.
> > >
> > > You mean patching pbs_mom or pbs_server, don't you? I don't see any
> > existing
> > > support for this.
> > >
> > > By the way, I don't see why the decision of setting a node down should
> > be
> > > delegated to the server or scheduler. In this case node knows that it is
> > > incapable of executing any jobs. In my view, it would be great if
> > pbs_mom
> > > could report itself as down if node check script returned non-zero exit
> > > code. STDERR output could be used to report a reason. Using STDOUT is
> > less
> > > clean. All that could be configurable (like "down_on_node_check_failure
> > > true" or something).
> > 
> > In MOM's config, $down_on_error can be used to have the MOM set itself
> > as "down" if there is an ERROR message from the health check script.
> 
> Oh, really? I was looking at
> http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:c_mom_confi
> guration, which does not have this parameter.
> 
> > Alternatively, you can set the down_on_error server attribute in qmgr to
> > do the same thing on the server.
> 
> Again, I was looking online at
> http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:b_server_pa
> rameters, which does not have this parameter.
> 
> > These are documented in the pbs_mom and pbs_server_attributes manpages.
> 
> You are right, it is documented in the manpages with the note: This feature
> is EXPERIMENTAL and likely to be removed in the future

I announced them to the list quite some time ago, but I never got any
feedback :)

 
> Are they mutually exclusive or supplement each other.

They aren't strictly mutually exclusive, but using both would be silly.
At the time, I didn't really know which method would be better.

I used the MOM config on my own cluster at first, but maui always kills
jobs with "down" nodes; and it was happening too often.  Later I choose
to use a cronjob that sets nodes "offline".



More information about the torqueusers mailing list