[torquedev] [patch] bind to ip on multihomed pbs_servers

Toni L. Harbaugh-Blackford [Contr] harbaugh at ncifcrf.gov
Fri Feb 8 01:01:26 MST 2008


I agree this is a very much needed feature.

I also have a patch, but mine is more invasive, modifying the svr_connect() 
and client_to_svr() functions by adding the ip address to bind to as a passed 
argument.  Your patch is much simpler, so I hope it makes it in.

We actually do the same thing with the 'mobile' alias.

Toni

On Thu, 7 Feb 2008, Henning Glawe wrote:

  > Moin,
  > pbs_server does not bind correctly to its assigned hostname/IP (with a
  > hostname on the command line like in
  > '/usr/sbin/pbs_server -a T -h torque.cluster').
  > 
  > This is true both for incoming connections:
  > 
  > root at n030:~> lsof -p `pidof pbs_server`
  > COMMAND    PID USER   FD   TYPE  DEVICE    SIZE    NODE NAME
  > pbs_serve 1818 root    6u  IPv4 1550253             TCP *:15001 (LISTEN)
  > pbs_serve 1818 root    7u  IPv4 1550254             UDP *:15001
  > pbs_serve 1818 root    8u  IPv4 1550255             UDP *:1023
  > 
  > and, even worse, for the outgoing ones, i.e. the source ip address of
  > outgoing ip packets seems not to be correctly set to the one extracted from
  > the -h option. The pbs_moms don't like to talk to the server if it uses the
  > wrong source ip.
  > 
  > Background:
  > I intend to setup torque in our linux cluster in such a way, that the
  > pbs_server is always reachable as hostname "torque" under ip 172.16.128.8,
  > regardless on which physical host it is running on.
  > As it is common and useful in such cases, I use an IP alias, i.e. I assign a
  > second ip to the server's cluster-communication-interface (both ips on the
  > same subnet, so there is only a single route pointing to the interface):
  > 
  > root at n030:~> ip addr
  > 3: ethNFS: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
  >     link/ether 00:e0:81:2a:0e:1a brd ff:ff:ff:ff:ff:ff
  >     inet 172.16.0.30/16 brd 172.16.255.255 scope global ethNFS
  > 	...
  >     inet 172.16.128.8/16 brd 172.16.255.255 scope global secondary ethNFS:8
  > 
  > So the source ip of the torque udp connections to the moms is usually the
  > first ip of the interface where also 172.16.128.8 is bound to.
  > 
  > 
  > Solution:
  > The attached patch binds pbs_server's connections explicitely to the IP given
  > with the -h option. If none is given, the server's behaviour is unchanged.
  > The problem is that the bind() calls are deep inside Libnet/Libifl, and as
  > these are part of the libtorque public API, a clean solution would cause a
  > change in the API (communicate the IP-to-be-bound-to from pbsd_main.c to
  > rpp.c, net_server.c, net_client.c and rm.c).
  > for my patch (which is a proof-of-concept), I added one global variable
  > containing the net-byte-order-representation of the host's IP to
  > rpp.c, which is then also accessed by libnet. this variable is initialized
  > to INADDR_ANY, so unless this variable is set to something else, pbs_server's
  > behaviour is unchanged.
  > in the -h getopt handler of pbs_server, i added a statement setting this
  > variable to the given ip.
  > 
  > can you please integrate something in the spirit of this patch into future
  > versions of torque? or give me an advice how to rewrite it in order to get
  > accepted directly?
  > -- 
  > c u
  > henning
  > 

-------------------------------------------------------------------
Toni Harbaugh-Blackford                       harbaugh at ncifcrf.gov
System Administrator
Advanced Biomedical Computing Center (ABCC)
National Cancer Institute
Contractor - SAIC/Frederick


More information about the torquedev mailing list