[torqueusers] scalability of 2.1.2?
somewhere_or_other at byu.edu
Fri Aug 11 10:19:44 MDT 2006
Hello, all. I work with Tom (original poster), and while he's on
vacation, I've taken over this problem. I thought everyone would like
to know the resolution, though (and Mr. Dave Jackson at ClusterResources
requested it on the phone too).
The problem was a network connectivity problem, not one with Torque
directly. Basically, there seems to be an MTU mismatch between the
machine that is running our PBS server, and the clients running the
pbs_moms. Since the MTU was set a few bytes higher on the server, and
the brand-spanking-new compute nodes use a different driver, it seems
that the path MTU negotiation is failing, and the server's ethernet
hardware was sending fragmented packets that were larger than the
clients could receive. They were dropped, or corrupted, or something.
I'm not exactly sure. Personally, I think its a problem with the driver
on the new clients, but that's just speculation.
Anyway, since Torque's packets kept increasing (telling everyone which
hosts they could trust), once it hit the server's MTU, it began
fragmenting, and the clients quit receiving - it dropped the packets
that were too big, and then couldn't reassemble the fragments since some
were missing. For us, the server had an MTU of 4004, the clients of
4000, and the threshold happened between 281 nodes and 282 nodes.
Anyway, we reset the server interface's MTU to 4000, and now both large
pings and Torque traffic are getting through. We're still working with
the hardware vendor to figure out why this happened. It is a pretty new
chipset and driver, though (Broadcom NeXtreme II 5708 - bnx2 driver, if
anyone is interested).
Thanks go to everyone on the list who posted suggestions, and the people
at ClusterResources for all their help as well.
BYU Supercomputing Lab
Donald Tripp wrote:
> Could this be a ethernet saturation problem? The clients may be timming
> - Don
> On Aug 8, 2006, at 8:57 AM, Thomas G. Raisor wrote:
>> Hi all,
>> I am trying to get torque 2.1.2 running on our cluster, but it seems
>> once I reach a certain number of compute-nodes, pbs_server gives up
>> and marks everything down.
>> Our cluster is in 4 pieces, 4 subnets, abc and d. The server runs on a
>> host on the b network. I add all the hosts in b and things work great.
>> I add a few compute nodes from the a network - things work great. I
>> add all the rest of the nodes for a, and everything gets marked down -
>> including all nodes on the b network, and they never become free again.
>> I have compiled using rpp and disable-rpp - same problem.
>> momctl -d 3 reports no problems when things are working, when I scale
>> up past my b network, the output on added nodes gives the warning: no
>> hello/cluster-addrs messages received from server
>> To this point I have only been able to successfully get up to about
>> 200 nodes of my 618 node cluster working. I have had it all working on
>> previous releases of torque and on RHEL 4U2 - I just upgraded to RHEL
>> 4U3 (platform rocks) and was updating torque at the same time and so
>> far have failed to get it working.
>> pbs_server log contents:
>> On startup:
>> 12:29:36;0040;PBS_Server;Req;ping_nodes;successful ping to node
>> m4a-4-1i (stream 347)
>> A few seconds later:
>> 08/08/2006 12:39:06;0001;PBS_Server;Svr;PBS_Server;stream_eof,
>> connection to m4a-4-1i is bad, remote service may be down, message may
>> be corrupt, or connection may have been dropped remotely (Premature
>> end of message). setting node state to down
>> AND later still:
>> 08/08/2006 12:41:32;0004;PBS_Server;Svr;check_nodes;node m4a-4-1i not
>> detected in 1155062492 seconds, marking node down
>> Mom logs:
>> 08/08/2006 12:44:53;0002; pbs_mom;Svr;Log;Log opened
>> 08/08/2006 12:44:53;0002; pbs_mom;Svr;usecp;*:/ibrix/home /ibrix/home
>> 08/08/2006 12:44:53;0002;
>> pbs_mom;Svr;usecp;*:/state/partition1/home /home
>> 08/08/2006 12:44:53;0002; pbs_mom;n/a;initialize;independent
>> 08/08/2006 12:44:53;0002; pbs_mom;Svr;pbs_mom;Is up
>> 08/08/2006 12:44:53;0002; pbs_mom;Svr;mom_main;MOM executable path
>> and mtime at launch: /usr/local/sbin/pbs_mom 1155061646
>> 08/08/2006 12:44:53;0002; pbs_mom;n/a;mom_main;hello sent to server
>> Tom Raisor
>> HPC Systems Administrator
>> Brigham Young University
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers