[torqueusers] incorrect nodes file causes pbs_server startup failure

Michael Gutteridge mgutteri at fhcrc.org
Tue Nov 13 11:37:31 MST 2007


We ran into a situation recently where we could not get pbs_server to
start up because the nodes file referenced a node that had been removed
from our naming systems:

[root ~]# /usr/sbin/pbs_server 
PBS_Server: process_host_name_part, host node04 not found
PBS_Server: pbsd_init(setup_nodes), could not create node "node04", error = 15062
PBS_Server: PBS_Server, pbsd_init failed
[root ~]# host node04
Host node04 not found: 3(NXDOMAIN)

We figured out the problem, and manually excising the node from the list
did the trick.  However, IMVHO, this shouldn't cause pbs_server to exit
completely, rather I'd expect a warning, but I'd expect it could
continue to set up the remaining nodes.  The rationalization being that
the remainder of the cluster could still function even if there's a
single misconfiguration.

Or is it not as simple as all that?



