[torqueusers] Re: scalability of 2.1.2?

Garrick Staples garrick at clusterresources.com
Tue Aug 8 12:27:12 MDT 2006


On Tue, Aug 08, 2006 at 12:57:40PM -0600, Thomas G. Raisor alleged:
> Hi all,
> 
> I am trying to get torque 2.1.2 running on our cluster, but it seems 
> once I reach a certain number of compute-nodes, pbs_server gives up and 
> marks everything down.
> 
> Our cluster is in 4 pieces, 4 subnets, abc and d. The server runs on a 
> host on the b network. I add all the hosts in b and things work great. I 
> add a few compute nodes from the a network - things work great. I add 
> all the rest of the nodes for a, and everything gets marked down - 
> including all nodes on the b network, and they never become free again.
> 
> I have compiled using rpp and disable-rpp - same problem.
> 
> momctl -d 3 reports no problems when things are working, when I scale up 
> past my b network, the output on added nodes gives the warning: no 
> hello/cluster-addrs messages received from server
> 
> To this point I have only been able to successfully get up to about 200 
> nodes of my 618 node cluster working. I have had it all working on 
> previous releases of torque and on RHEL 4U2 - I just upgraded to RHEL 
> 4U3 (platform rocks) and was updating torque at the same time and so far 
> have failed to get it working.
> 
> Suggestions?

I assure you that TORQUE works well with thousands of nodes.

Probably just need to tune it a bit.  Increase pbs_mom's
status_update_time variable (details in pbs_mom manpage), increase
server's job_stat_rate and node_check_rate (details in
pbs_server_attributes manpage).

 


More information about the torqueusers mailing list