[torqueusers] Re: [Mauiusers] node health check

'Garrick Staples' garrick at clusterresources.com
Tue Nov 14 12:51:32 MST 2006


On Tue, Nov 14, 2006 at 10:45:43AM -0800, Alexander Saydakov alleged:
> > -----Original Message-----
> > From: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> > bounces at supercluster.org] On Behalf Of Garrick Staples
> > Sent: Monday, November 13, 2006 5:02 PM
> > To: mauiusers at supercluster.org
> > Subject: Re: [Mauiusers] node health check
> > 
> > On Mon, Nov 13, 2006 at 04:10:43PM -0800, Alexander Saydakov alleged:
> > > Hi!
> > >
> > > Does Maui support node health checks done by setting $node_check_script
> > in
> > > Torque MOM's config?
> > 
> > Does it read the ERROR message?  no.
> > 
> > At USC, we have a 10 minute cronjob that makes those nodes offline.  You
> > can also have MOM or the server set nodes "down" with any ERROR.
> 
> You mean patching pbs_mom or pbs_server, don't you? I don't see any existing
> support for this.
> 
> By the way, I don't see why the decision of setting a node down should be
> delegated to the server or scheduler. In this case node knows that it is
> incapable of executing any jobs. In my view, it would be great if pbs_mom
> could report itself as down if node check script returned non-zero exit
> code. STDERR output could be used to report a reason. Using STDOUT is less
> clean. All that could be configurable (like "down_on_node_check_failure
> true" or something).

In MOM's config, $down_on_error can be used to have the MOM set itself
as "down" if there is an ERROR message from the health check script.

Alternatively, you can set the down_on_error server attribute in qmgr to
do the same thing on the server.

These are documented in the pbs_mom and pbs_server_attributes manpages.




More information about the torqueusers mailing list