[torqueusers] nodes won't connect to pbs_server outside of chassis
DuChene, StevenX A
stevenx.a.duchene at intel.com
Tue Aug 2 17:34:15 MDT 2011
I have a compact high density chassis I am working with that has 256 "servers" in a 10U chassis. It is sort of like a blade chassis.
I have RHEL-6.1 installed on all of these 256 servers as well as a separate standard 2U server outside the chassis.
The internal chassis has a back-plane network that all 256 of the systems are connected to. There are 16 external Gig-E ports for communication outside the chassis.
I have built torque-2.5.7 as rpms from the spec file enclosed in the gzipped tar ball I downloaded from CRI web site.
I installed the torque and torque-client rpms on my 256 nodes and the torque-server rpm on the external 2U system outside the chassis. This external system also provides dhcpd, named, pxeboot and etc services. This external system has a 10Gig-E card for the connection to the chassis and a 1 Gig-E connection to the other systems in the data center.
The internal nodes are on the same class 23 network as the 10Gig-E interface of the external server.
I configured all of the client nodes with the name of the server in the /var/spool/torque/server_name and in /var/spool/torque/mom_priv/config files. I put all of the names of the client nodes in /var/spool/torque/server_priv/nodes file.
If I run pbsnodes after starting all of the mom daemons and the server daemon all of the nodes are shown as down. If I run momctl -d 3 -f somehost it tells me:
WARNING: no hello/cluster-addrs messages received from server
WARNING: no messages received from server
And in my /var/log/messages file I get things like:
Jul 29 10:37:57 edmin01 PBS_Server: LOG_ERROR::stream_eof, connection to node002 is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (Premature end of message). setting node state to down
So as a test I installed the torque-server rpm on one of the internal nodes and did the same configuration steps on that system as I did on the external server system. I then altered the mom_priv/config and server_name files across all of the nodes to point to this system inside the chassis instead. I restarted all of the mom daemons across the cluster and now when I run pbsnodes everything works just fine. All 256 nodes are free and alive.
Is there any torque experts here who can suggest some additional troubleshooting steps I can try to see what might be going on with the connection to the external server?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers