[torqueusers] incorrect nodes file causes pbs_server startup
garrick at usc.edu
Mon Nov 26 15:31:08 MST 2007
On Mon, Nov 26, 2007 at 10:17:08AM -0800, Michael Gutteridge alleged:
> On Wed, 2007-11-21 at 21:19 -0800, Garrick Staples wrote:
> > We couldn't bring it up down,offline because it really is the name
> > lookup
> > really is a failure to create the node. We'd have to add in a ton of
> > extra
> > code to retry the process later.
> > I'm having a hard time with the original request to let pbs_server
> > continue
> > running. On the one hand, it isn't really fatal to the rest of the
> > cluster.
> > On the other, such misconfigs can easily go unnoticed. I think that
> > personally, on my own cluster, I'd rather have the hard failure to
> > point out
> > the obvious error.
> I don't know that I have a truly compelling argument in favor of letting
> pbs_server continue running in this particular scenario. The two points
> I might make about that are:
> * what happens when something like this happens "in flight"?
> IIRC, pbs_server continues happily along, although I don't
> recall what the node looks like. (offline? down?)
Nothing. If the node is up, then everything will continue working despite
lack of name resolution. A file copy will probably break somewhere, but that
should raise an obvious error message as well.
> * if the preferred mechanism for managing nodes is qmgr versus
> editing the nodes file, you've got a chicken-egg problem (can't
> get pbs_server to run, so can't get it to expunge the bad node).
I don't have a preferred mechanism. If pbs_server is running, use qmgr.
Otherwise, use vi. (Both are tasty, why choose one?)
> Not strong arguments, I'll grant you. For what it's worth, the error
> message is clear, the resolution is simple, so this is certainly nothing
> more than personal preference.
Certainly debatable. I think this is a case for my imaginary veto-power to
keep the status quo.
If someone wants to write a patch that partially creates the node in a down
state and periodically retries the IP lookup to finish the creation; and
doesn't make it racy and obnoxious, then we'll certainly have something more
concrete to consider.
Silently dropping the node isn't an option.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071126/c604ba7b/attachment.bin
More information about the torqueusers