[torqueusers] incorrect nodes file causes pbs_server startup
garrick at usc.edu
Wed Nov 21 22:19:00 MST 2007
On Wed, Nov 14, 2007 at 06:34:20AM +1100, Chris Samuel alleged:
> On Wed, 14 Nov 2007, Michael Gutteridge wrote:
> > We figured out the problem, and manually excising the node from the
> > list did the trick. However, IMVHO, this shouldn't cause
> > pbs_server to exit completely, rather I'd expect a warning, but I'd
> > expect it could continue to set up the remaining nodes. The
> > rationalization being that the remainder of the cluster could still
> > function even if there's a single misconfiguration.
> I guess it could just mark the node down,offline instead, though I
> don't know if that would stop pbs_server from trying to poll it..
We couldn't bring it up down,offline because it really is the name lookup
really is a failure to create the node. We'd have to add in a ton of extra
code to retry the process later.
I'm having a hard time with the original request to let pbs_server continue
running. On the one hand, it isn't really fatal to the rest of the cluster.
On the other, such misconfigs can easily go unnoticed. I think that
personally, on my own cluster, I'd rather have the hard failure to point out
the obvious error.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071121/ad5cde03/attachment.bin
More information about the torqueusers