[torqueusers] Multi-homed Question

Adam Emerich aemerich at us.ibm.com
Mon Jul 28 10:00:53 MDT 2008


I have a very large cluster (3060 opteron-based nodes).  The compute nodes
are separated into 17 groups of 180 nodes.  I built the server with the
following options:

./configure --enable-docs --disable-gui --enable-syslog --with-scp
--disable-rpp --disable-spool

The management node (where server is running) has two VLANs configured,
CVLAN and MVLAN.  We need to nodes to communicate back to the master on the
CVLAN.  The management node has a hostname of mn (MVLAN), but the CVLAN
interface is known as mnc to the compute nodes.  Hostname mn (MVLAN) is not
pingable from the compute nodes.  We are seeing a failure in interactive
submission only that looks like this:

Jul 28 10:08:23 rrp001a pbs_mom: Interrupted system call (4) in
TMomFinalizeChild, cannot open interactive qsub socket to host mn:39914 -
'cannot bind to port 1023 in client_to_svr - connection refused' - check
routing tables/multi-homed host issues

Here is the breakdown of the connections:

Management node (hostname mn):
      MVLAN             11.16.0.1
      CVLAN             11.15.0.1
Compute Nodes:.
      MVLAN (hostname mn)                 not accessible
      CVLAN (hostname mnc)                11.15.0.1

The compute node is trying to open a qsub connection to hostname mn.  We
have set the server_name on the compute nodes file to mnc and added the
following to the mom_priv/config file:

$clienthost mnc
$restricted mnc

We also added the server_name setting to qmgr to the following

set server server_name=mnc

If we add an alias to the /etc/hosts file on a compute node to make
hostname mn point to mnc, everything works fine, but this is not the
solution we would like to use.  Is there a way to tell the mom clients to
respond to a different hostname than what is being sent from the management
node?

Thanks
Adam



More information about the torqueusers mailing list