[torqueusers] Remote node health check?

Gabe Turner gabe at msi.umn.edu
Tue Oct 19 08:23:03 MDT 2010


On Tue, Oct 19, 2010 at 08:54:38AM -0500, Chris Evert wrote:
> Torqueusers,
> 
> I have some nodes which intermittently stop accepting queued jobs.  I 
> get messages like
> 
> 10/18/2010 17:12:31;0080;PBS_Server;Req;req_reject;Reject reply 
> code=15041(Execution server rejected request REJHOST=badnode MSG=cannot 
> send job to badnode, state=PRERUN), aux=0, type=RunJob, from root at doler
> 
> in the server log.
> 
> Multiple jobs seem to be assigned to this node (since it is not running 
> anything) and they languish waiting to be accepted.
> 
> When I take the node offline, the jobs go find good nodes and run.
> 
> Is there a way to specify a remote node health check so I can take this 
> node down when this bad behavior is detected?  Is there a way to detect 
> this bad behavior other than scanning the log?  (Is this really a maui 
> question, since it has to do with scheduling jobs?)

There is the node health check which runs on the node itself, which we find
to be extremely useful for setting a 'sick' node offline (for known and
definable definitions of 'sick').  

http://www.clusterresources.com/products/torque/docs/10.2healthcheck.shtml

I'm afraid, though, that we haven't gotten away from scanning logs for
external signs of node illness.  We use syslog-ng on the management node
(or log host if separate) of a cluster and send some logs to a pipe, which
we then have a perl daemon monitoring for illness.  If something is seen
(such as an OOMKiller instantiation, which would have prevented the node
health check from running), we still set the node offline (just using
pbsnodes).

I can think of other, more elegant ways to cover many signs of illness,
though, such as Nagios.  However, setting up Nagios simply for checking for
a few signs of node illness might be a bit of overkill.

I recommend digging into your MOM logs to figure out why the server is
categorizing the node as a 'badnode'.  It may be indicative of a failure
state that you can correct, or at least mitigate.

HTH,

Gabe

-- 
Gabe Turner                                             gabe at msi.umn.edu
HPC Systems Administrator,
University of Minnesota
Supercomputing Institute                          http://www.msi.umn.edu


More information about the torqueusers mailing list