[torqueusers] incorrect nodes file causes pbs_server startup failure

Garrick Staples garrick at usc.edu
Wed Nov 21 22:19:00 MST 2007


On Wed, Nov 14, 2007 at 06:34:20AM +1100, Chris Samuel alleged:
> On Wed, 14 Nov 2007, Michael Gutteridge wrote:
> 
> > We figured out the problem, and manually excising the node from the
> > list did the trick.  However, IMVHO, this shouldn't cause
> > pbs_server to exit completely, rather I'd expect a warning, but I'd
> > expect it could continue to set up the remaining nodes.  The
> > rationalization being that the remainder of the cluster could still
> > function even if there's a single misconfiguration.
> 
> I guess it could just mark the node down,offline instead, though I 
> don't know if that would stop pbs_server from trying to poll it..

We couldn't bring it up down,offline because it really is the name lookup
really is a failure to create the node.  We'd have to add in a ton of extra
code to retry the process later.

I'm having a hard time with the original request to let pbs_server continue
running.  On the one hand, it isn't really fatal to the rest of the cluster.
On the other, such misconfigs can easily go unnoticed.  I think that
personally, on my own cluster, I'd rather have the hard failure to point out
the obvious error.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071121/ad5cde03/attachment.bin


More information about the torqueusers mailing list