[torqueusers] RE: Health check script failure and offlining

Lennart Karlsson Lennart.Karlsson at nsc.liu.se
Fri Dec 9 01:48:35 MST 2005


Garrick,

You wrote:
> There's also the "down_on_error" server parameter.  It would be pretty
> easy to make an "offline_on_error" (or maybe something more flexible) if
> people want it.  I recall Chris Samuel requesting a way to prevent jobs
> from being scheduled on nodes that had rebooted.
> 
> I've had down_on_error enabled for the last few months, but just
> recently turned it off because maui was killing off too many jobs.  We
> opted for a cronjob too.


I would like to try an "offline_on_error" server parameter (or "maybe
something more flexible") together with a health check script.

What is the algorithm you use in your cron script and at what interval do
you run it? Are you basically only reading the 'message' attribute and
acting on it?

I would also like to be prevent jobs from being scheduled on newly rebooted
nodes. Before letting them run jobs I would like to have a health check on
the node and also a check from the server side (e.g., does the interconnect
to and from this node still work?), but I am not sure about the best model
to implement this. Of course a check of the node could be initiated from the
server side.

I am not up-to-date on the health check implementation. After a restart,
does it automatically run the health check script before reconnecting
to the pbs server, meaning that the 'message' attribute can be read by
the server before deciding on offline or ready status for the compute node?
In that case, it would perhaps be better to let an admin-written script/program
on the server side read the 'message' attribute and make the decision
depending on the attribute value and on all other algorithms that the admin
can think of?

-- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
   National Supercomputer Centre in Linkoping, Sweden
   http://www.nsc.liu.se




More information about the torqueusers mailing list