[torqueusers] pbs_mom binding to a given IP?
garrick at usc.edu
Mon Dec 5 12:33:05 MST 2005
On Sun, Dec 04, 2005 at 02:12:50PM -0500, Juan Gallego alleged:
> greetings all,
> i'm deploying torque (v2.0p2) on our cluster and i have a problem stemming
> from our awkward network configuration: each node has 2 CPUs, and 2 NICs,
> with each NIC connected to a separate network, so each CPU is
> `virtually bound' to one of the NICs. to the user, it's looks like
> one cpu, one `virtual node': node0-0, node0-1 (each of the CPUs on node0),
> node1-0, node1-1 (on node1), etc.
> the pbs_server node file contains the following:
> node0-0 np=1
> node0-1 np=1
> node1-0 np=1
> node1-1 np=1
> node2-0 np=1
> node2-1 np=1
> the pbs_mom config file contains 2 $pbsserver entries, one for each interface
> on the server (this was necessary to get pbs_server to see the status
> from both the *-0 and *-1 nodes).
> the problem is that pbs_mom seems to get confused when it gets what it
> considers duplicate requests from its peers (it's really getting one
> from say node1-0 directed to `node0-0' and another directed to `node0-1').
> this could be avoided by running 2 pbs_mom per node (or one per `virtual
> node' if you will), but now it can't be done because pbs_mom binds to the
> INADDR_ANY address, so the second pbs_mom's bind fails. if there was an
> option to bind a pbs_mom to a particular IP (one to the -0 and the other to
> -1), then our weird setup would work.
> thoughts? suggestions? options? alternatives?
I don't think the bind address is going to be your only problem with 2
MOMs on 1 node. You'll probably need 2 different PBS_SERVER_HOME
Why have this awkward setup? TORQUE handles SMP nodes just fine. Is
there something lacking in the SMP support? I get the feeling you are
solving a problem that should be solved with the scheduler, or maybe
full OS virtualization.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051205/02ec0633/attachment.bin
More information about the torqueusers