[torqueusers] nodes won't connect to pbs_server outside of chassis

Chandler Wilkerson chwilk at rice.edu
Wed Aug 3 10:38:28 MDT 2011


Check that the nodes are resolving the host name of the server as its 
10-Gig address, and vice versa (the server resolves the nodes by the 
network at its 10-Gig address.) Further, you should check ping in both 
directions.

--
Chandler Wilkerson
Rice University

On 08/02/2011 06:34 PM, DuChene, StevenX A wrote:
> I have a compact high density chassis I am working with that has 256
> “servers” in a 10U chassis. It is sort of like a blade chassis.
>
> I have RHEL-6.1 installed on all of these 256 servers as well as a
> separate standard 2U server outside the chassis.
>
> The internal chassis has a back-plane network that all 256 of the
> systems are connected to. There are 16 external Gig-E ports for
> communication outside the chassis.
>
> I have built torque-2.5.7 as rpms from the spec file enclosed in the
> gzipped tar ball I downloaded from CRI web site.
>
> I installed the torque and torque-client rpms on my 256 nodes and the
> torque-server rpm on the external 2U system outside the chassis. This
> external system also provides dhcpd, named, pxeboot and etc services.
> This external system has a 10Gig-E card for the connection to the
> chassis and a 1 Gig-E connection to the other systems in the data center.
>
> The internal nodes are on the same class 23 network as the 10Gig-E
> interface of the external server.
>
> I configured all of the client nodes with the name of the server in the
> /var/spool/torque/server_name and in /var/spool/torque/mom_priv/config
> files. I put all of the names of the client nodes in
> /var/spool/torque/server_priv/nodes file.
>
> If I run pbsnodes after starting all of the mom daemons and the server
> daemon all of the nodes are shown as down. If I run momctl –d 3 –f
> somehost it tells me:
>
> WARNING: no hello/cluster-addrs messages received from server
>
> And
>
> WARNING: no messages received from server
>
> And in my /var/log/messages file I get things like:
>
> Jul 29 10:37:57 edmin01 PBS_Server: LOG_ERROR::stream_eof, connection to
> node002 is bad, remote service may be down, message may be corrupt, or
> connection may have been dropped remotely (Premature end of message).
> setting node state to down
>
> So as a test I installed the torque-server rpm on one of the internal
> nodes and did the same configuration steps on that system as I did on
> the external server system. I then altered the mom_priv/config and
> server_name files across all of the nodes to point to this system inside
> the chassis instead. I restarted all of the mom daemons across the
> cluster and now when I run pbsnodes everything works just fine. All 256
> nodes are free and alive.
>
> Is there any torque experts here who can suggest some additional
> troubleshooting steps I can try to see what might be going on with the
> connection to the external server?
>
> --
>
> Steven DuChene
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list