[torqueusers] Re: scalability of 2.1.2?
Thomas G. Raisor
thunder at et.byu.edu
Tue Aug 8 17:30:45 MDT 2006
Ok,
I have narrowed this down further - I can have exactly 281 nodes in my
nodes file. Upon adding number 282 things break down.
If I add #282 to the nodes file, then start pbs_server - nodes are in
state down forever. If I have 281 nodes in the nodes file, start
pbs_server, everything is fine. I use qmgr create node, then pbsnodes -l
immediately starts showing all nodes as marked down again, and they stay
down until I restart the server (with 281 nodes in nodes file). #282 can
be any node in my cluster. This appears to be a bug. Has anyone else
seen this kind of behavior?
Tom
--
Garrick Staples wrote:
> On Tue, Aug 08, 2006 at 12:57:40PM -0600, Thomas G. Raisor alleged:
>> Hi all,
>>
>> I am trying to get torque 2.1.2 running on our cluster, but it seems
>> once I reach a certain number of compute-nodes, pbs_server gives up and
>> marks everything down.
>>
>> Our cluster is in 4 pieces, 4 subnets, abc and d. The server runs on a
>> host on the b network. I add all the hosts in b and things work great. I
>> add a few compute nodes from the a network - things work great. I add
>> all the rest of the nodes for a, and everything gets marked down -
>> including all nodes on the b network, and they never become free again.
>>
>> I have compiled using rpp and disable-rpp - same problem.
>>
>> momctl -d 3 reports no problems when things are working, when I scale up
>> past my b network, the output on added nodes gives the warning: no
>> hello/cluster-addrs messages received from server
>>
>> To this point I have only been able to successfully get up to about 200
>> nodes of my 618 node cluster working. I have had it all working on
>> previous releases of torque and on RHEL 4U2 - I just upgraded to RHEL
>> 4U3 (platform rocks) and was updating torque at the same time and so far
>> have failed to get it working.
>>
>> Suggestions?
>
> I assure you that TORQUE works well with thousands of nodes.
>
> Probably just need to tune it a bit. Increase pbs_mom's
> status_update_time variable (details in pbs_mom manpage), increase
> server's job_stat_rate and node_check_rate (details in
> pbs_server_attributes manpage).
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
--
~-~-~-~-~-
Tom Raisor
HPC Systems Administrator
Brigham Young University
More information about the torqueusers
mailing list