[torqueusers] RE: Health check script failure and offlining

Garrick Staples garrick at usc.edu
Fri Dec 9 10:35:57 MST 2005

On Fri, Dec 09, 2005 at 09:48:35AM +0100, Lennart Karlsson alleged:
> Garrick,
> You wrote:
> > There's also the "down_on_error" server parameter.  It would be pretty
> > easy to make an "offline_on_error" (or maybe something more flexible) if
> > people want it.  I recall Chris Samuel requesting a way to prevent jobs
> > from being scheduled on nodes that had rebooted.
> > 
> > I've had down_on_error enabled for the last few months, but just
> > recently turned it off because maui was killing off too many jobs.  We
> > opted for a cronjob too.
> I would like to try an "offline_on_error" server parameter (or "maybe
> something more flexible") together with a health check script.
> What is the algorithm you use in your cron script and at what interval do
> you run it? Are you basically only reading the 'message' attribute and
> acting on it?

for node in pbsnodes -a;do
   if offline, next
   if error, mark offline and send an email.

> I would also like to be prevent jobs from being scheduled on newly rebooted

We need to think this one through.  This could be a real management
headache.  I, for one, don't want to online nodes everytime I reboot

> nodes. Before letting them run jobs I would like to have a health check on
> the node and also a check from the server side (e.g., does the interconnect
> to and from this node still work?), but I am not sure about the best model
> to implement this. Of course a check of the node could be initiated from the
> server side.
> I am not up-to-date on the health check implementation. After a restart,
> does it automatically run the health check script before reconnecting
> to the pbs server, meaning that the 'message' attribute can be read by
> the server before deciding on offline or ready status for the compute node?
> In that case, it would perhaps be better to let an admin-written script/program
> on the server side read the 'message' attribute and make the decision
> depending on the attribute value and on all other algorithms that the admin
> can think of?

pbs_server assumes nodes are down until they report in.  pbs_mom won't
report in until the health check script is run.  So you can't be
scheduling jobs on nodes without first having seen the error message.

Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051209/77058273/attachment.bin

More information about the torqueusers mailing list