[torqueusers] scalability of 2.1.2?

Thomas G. Raisor thunder at et.byu.edu
Tue Aug 8 12:57:40 MDT 2006

Hi all,

I am trying to get torque 2.1.2 running on our cluster, but it seems 
once I reach a certain number of compute-nodes, pbs_server gives up and 
marks everything down.

Our cluster is in 4 pieces, 4 subnets, abc and d. The server runs on a 
host on the b network. I add all the hosts in b and things work great. I 
add a few compute nodes from the a network - things work great. I add 
all the rest of the nodes for a, and everything gets marked down - 
including all nodes on the b network, and they never become free again.

I have compiled using rpp and disable-rpp - same problem.

momctl -d 3 reports no problems when things are working, when I scale up 
past my b network, the output on added nodes gives the warning: no 
hello/cluster-addrs messages received from server

To this point I have only been able to successfully get up to about 200 
nodes of my 618 node cluster working. I have had it all working on 
previous releases of torque and on RHEL 4U2 - I just upgraded to RHEL 
4U3 (platform rocks) and was updating torque at the same time and so far 
have failed to get it working.



pbs_server log contents:

On startup:
12:29:36;0040;PBS_Server;Req;ping_nodes;successful ping to node m4a-4-1i 
(stream 347)

A few seconds later:
08/08/2006 12:39:06;0001;PBS_Server;Svr;PBS_Server;stream_eof, 
connection to m4a-4-1i is bad, remote service may be down, message may 
be corrupt, or connection may have been dropped remotely (Premature end 
of message).  setting node state to down

AND later still:
08/08/2006 12:41:32;0004;PBS_Server;Svr;check_nodes;node m4a-4-1i not 
detected in 1155062492 seconds, marking node down

Mom logs:
08/08/2006 12:44:53;0002;   pbs_mom;Svr;Log;Log opened
08/08/2006 12:44:53;0002;   pbs_mom;Svr;usecp;*:/ibrix/home   /ibrix/home
08/08/2006 12:44:53;0002;   pbs_mom;Svr;usecp;*:/state/partition1/home 
08/08/2006 12:44:53;0002;   pbs_mom;n/a;initialize;independent
08/08/2006 12:44:53;0002;   pbs_mom;Svr;pbs_mom;Is up
08/08/2006 12:44:53;0002;   pbs_mom;Svr;mom_main;MOM executable path and 
mtime at launch: /usr/local/sbin/pbs_mom 1155061646
08/08/2006 12:44:53;0002;   pbs_mom;n/a;mom_main;hello sent to server m4bi

Tom Raisor
HPC Systems Administrator
Brigham Young University

