[torqueusers] incorrect nodes file causes pbs_server startup failure

Michael Gutteridge mgutteri at fhcrc.org
Mon Nov 26 11:17:08 MST 2007


On Wed, 2007-11-21 at 21:19 -0800, Garrick Staples wrote:
> We couldn't bring it up down,offline because it really is the name
> lookup
> really is a failure to create the node.  We'd have to add in a ton of
> extra
> code to retry the process later.
> 
> I'm having a hard time with the original request to let pbs_server
> continue
> running.  On the one hand, it isn't really fatal to the rest of the
> cluster.
> On the other, such misconfigs can easily go unnoticed.  I think that
> personally, on my own cluster, I'd rather have the hard failure to
> point out
> the obvious error. 

I don't know that I have a truly compelling argument in favor of letting
pbs_server continue running in this particular scenario.  The two points
I might make about that are:
      * what happens when something like this happens "in flight"?
        IIRC, pbs_server continues happily along, although I don't
        recall what the node looks like. (offline? down?)
      * if the preferred mechanism for managing nodes is qmgr versus
        editing the nodes file, you've got a chicken-egg problem (can't
        get pbs_server to run, so can't get it to expunge the bad node).

Not strong arguments, I'll grant you. For what it's worth, the error
message is clear, the resolution is simple, so this is certainly nothing
more than personal preference.

Thanks

Michael

-- 
Before enlightenment, chop wood, haul water. 
After enlightenment, chop wood, haul water.
   - Zen Proverb



More information about the torqueusers mailing list