[torqueusers] Problems with maui scalability
wyckoff at yahoo-inc.com
Wed Sep 12 11:36:22 MDT 2007
I think we run jobs in the 900 node range with no problems daily although
our average size is smaller. I haven't looked in the logs to see if we're
having communication problems that are masked by retries or something
On 9/12/07 12:14 AM, "Lennart Karlsson" <Lennart.Karlsson at nsc.liu.se> wrote:
> meo at intrinsity.com said:
>> Peter Wyckoff said...
>> |I'm wondering how big you've gotten maui and torque to scale, mostly
>> |interested in number of nodes?
>> |The docs say something like 1,000 but I think it scales well beyond that,
>> That's what I've heard. Right now we're at about 300 nodes.
> Are you able to start a parallel job spanning all of these 300 nodes
> or is the mom-to-mom communication setup breaking down?
> We have problems starting jobs wider than about 100 nodes, because
> that amount of moms gets difficulties synchronizing among themselves
> at startup.
> -- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
> National Supercomputer Centre in Linkoping, Sweden
More information about the torqueusers