[torqueusers] incorrect nodes file causes pbs_server startup failure

Garrick Staples garrick at usc.edu
Mon Nov 26 15:31:08 MST 2007


On Mon, Nov 26, 2007 at 10:17:08AM -0800, Michael Gutteridge alleged:
> On Wed, 2007-11-21 at 21:19 -0800, Garrick Staples wrote:
> > We couldn't bring it up down,offline because it really is the name
> > lookup
> > really is a failure to create the node.  We'd have to add in a ton of
> > extra
> > code to retry the process later.
> > 
> > I'm having a hard time with the original request to let pbs_server
> > continue
> > running.  On the one hand, it isn't really fatal to the rest of the
> > cluster.
> > On the other, such misconfigs can easily go unnoticed.  I think that
> > personally, on my own cluster, I'd rather have the hard failure to
> > point out
> > the obvious error. 
> 
> I don't know that I have a truly compelling argument in favor of letting
> pbs_server continue running in this particular scenario.  The two points
> I might make about that are:
>       * what happens when something like this happens "in flight"?
>         IIRC, pbs_server continues happily along, although I don't
>         recall what the node looks like. (offline? down?)

Nothing.  If the node is up, then everything will continue working despite
lack of name resolution.  A file copy will probably break somewhere, but that
should raise an obvious error message as well.


>       * if the preferred mechanism for managing nodes is qmgr versus
>         editing the nodes file, you've got a chicken-egg problem (can't
>         get pbs_server to run, so can't get it to expunge the bad node).

I don't have a preferred mechanism.  If pbs_server is running, use qmgr.
Otherwise, use vi.  (Both are tasty, why choose one?)

 
> Not strong arguments, I'll grant you. For what it's worth, the error
> message is clear, the resolution is simple, so this is certainly nothing
> more than personal preference.

Certainly debatable.  I think this is a case for my imaginary veto-power to
keep the status quo.

If someone wants to write a patch that partially creates the node in a down
state and periodically retries the IP lookup to finish the creation; and
doesn't make it racy and obnoxious, then we'll certainly have something more
concrete to consider.  

Silently dropping the node isn't an option.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071126/c604ba7b/attachment.bin


More information about the torqueusers mailing list