[torqueusers] Remote node health check?

Chris Evert chris.evert at geokinetics.com
Tue Oct 19 07:54:38 MDT 2010


I have some nodes which intermittently stop accepting queued jobs.  I 
get messages like

10/18/2010 17:12:31;0080;PBS_Server;Req;req_reject;Reject reply 
code=15041(Execution server rejected request REJHOST=badnode MSG=cannot 
send job to badnode, state=PRERUN), aux=0, type=RunJob, from root at doler

in the server log.

Multiple jobs seem to be assigned to this node (since it is not running 
anything) and they languish waiting to be accepted.

When I take the node offline, the jobs go find good nodes and run.

Is there a way to specify a remote node health check so I can take this 
node down when this bad behavior is detected?  Is there a way to detect 
this bad behavior other than scanning the log?  (Is this really a maui 
question, since it has to do with scheduling jobs?)

torque 2.3.10
maui 3.3

Thanks for any advice,
Chris Evert
Geokinetics, Inc.
Houston, TX

More information about the torqueusers mailing list