[torqueusers] Remote node health check?
chris.evert at geokinetics.com
Tue Oct 19 07:54:38 MDT 2010
I have some nodes which intermittently stop accepting queued jobs. I
get messages like
10/18/2010 17:12:31;0080;PBS_Server;Req;req_reject;Reject reply
code=15041(Execution server rejected request REJHOST=badnode MSG=cannot
send job to badnode, state=PRERUN), aux=0, type=RunJob, from root at doler
in the server log.
Multiple jobs seem to be assigned to this node (since it is not running
anything) and they languish waiting to be accepted.
When I take the node offline, the jobs go find good nodes and run.
Is there a way to specify a remote node health check so I can take this
node down when this bad behavior is detected? Is there a way to detect
this bad behavior other than scanning the log? (Is this really a maui
question, since it has to do with scheduling jobs?)
Thanks for any advice,
More information about the torqueusers