[torquedev] [patch] bind to ip on multihomed pbs_servers
Toni L. Harbaugh-Blackford [Contr]
harbaugh at ncifcrf.gov
Fri Feb 8 01:01:26 MST 2008
I agree this is a very much needed feature.
I also have a patch, but mine is more invasive, modifying the svr_connect()
and client_to_svr() functions by adding the ip address to bind to as a passed
argument. Your patch is much simpler, so I hope it makes it in.
We actually do the same thing with the 'mobile' alias.
On Thu, 7 Feb 2008, Henning Glawe wrote:
> pbs_server does not bind correctly to its assigned hostname/IP (with a
> hostname on the command line like in
> '/usr/sbin/pbs_server -a T -h torque.cluster').
> This is true both for incoming connections:
> root at n030:~> lsof -p `pidof pbs_server`
> COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
> pbs_serve 1818 root 6u IPv4 1550253 TCP *:15001 (LISTEN)
> pbs_serve 1818 root 7u IPv4 1550254 UDP *:15001
> pbs_serve 1818 root 8u IPv4 1550255 UDP *:1023
> and, even worse, for the outgoing ones, i.e. the source ip address of
> outgoing ip packets seems not to be correctly set to the one extracted from
> the -h option. The pbs_moms don't like to talk to the server if it uses the
> wrong source ip.
> I intend to setup torque in our linux cluster in such a way, that the
> pbs_server is always reachable as hostname "torque" under ip 172.16.128.8,
> regardless on which physical host it is running on.
> As it is common and useful in such cases, I use an IP alias, i.e. I assign a
> second ip to the server's cluster-communication-interface (both ips on the
> same subnet, so there is only a single route pointing to the interface):
> root at n030:~> ip addr
> 3: ethNFS: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
> link/ether 00:e0:81:2a:0e:1a brd ff:ff:ff:ff:ff:ff
> inet 172.16.0.30/16 brd 172.16.255.255 scope global ethNFS
> inet 172.16.128.8/16 brd 172.16.255.255 scope global secondary ethNFS:8
> So the source ip of the torque udp connections to the moms is usually the
> first ip of the interface where also 172.16.128.8 is bound to.
> The attached patch binds pbs_server's connections explicitely to the IP given
> with the -h option. If none is given, the server's behaviour is unchanged.
> The problem is that the bind() calls are deep inside Libnet/Libifl, and as
> these are part of the libtorque public API, a clean solution would cause a
> change in the API (communicate the IP-to-be-bound-to from pbsd_main.c to
> rpp.c, net_server.c, net_client.c and rm.c).
> for my patch (which is a proof-of-concept), I added one global variable
> containing the net-byte-order-representation of the host's IP to
> rpp.c, which is then also accessed by libnet. this variable is initialized
> to INADDR_ANY, so unless this variable is set to something else, pbs_server's
> behaviour is unchanged.
> in the -h getopt handler of pbs_server, i added a statement setting this
> variable to the given ip.
> can you please integrate something in the spirit of this patch into future
> versions of torque? or give me an advice how to rewrite it in order to get
> accepted directly?
> c u
Toni Harbaugh-Blackford harbaugh at ncifcrf.gov
Advanced Biomedical Computing Center (ABCC)
National Cancer Institute
Contractor - SAIC/Frederick
More information about the torquedev