[torqueusers] RE: [Mauiusers] node health check
saydakov at yahoo-inc.com
Tue Nov 14 11:45:43 MST 2006
> -----Original Message-----
> From: mauiusers-bounces at supercluster.org [mailto:mauiusers-
> bounces at supercluster.org] On Behalf Of Garrick Staples
> Sent: Monday, November 13, 2006 5:02 PM
> To: mauiusers at supercluster.org
> Subject: Re: [Mauiusers] node health check
> On Mon, Nov 13, 2006 at 04:10:43PM -0800, Alexander Saydakov alleged:
> > Hi!
> > Does Maui support node health checks done by setting $node_check_script
> > Torque MOM's config?
> Does it read the ERROR message? no.
> At USC, we have a 10 minute cronjob that makes those nodes offline. You
> can also have MOM or the server set nodes "down" with any ERROR.
You mean patching pbs_mom or pbs_server, don't you? I don't see any existing
support for this.
By the way, I don't see why the decision of setting a node down should be
delegated to the server or scheduler. In this case node knows that it is
incapable of executing any jobs. In my view, it would be great if pbs_mom
could report itself as down if node check script returned non-zero exit
code. STDERR output could be used to report a reason. Using STDOUT is less
clean. All that could be configurable (like "down_on_node_check_failure
true" or something).
More information about the torqueusers