[torqueusers] Problems with maui scalability

Peter Wyckoff wyckoff at yahoo-inc.com
Wed Sep 12 11:36:22 MDT 2007


I think we run jobs in the 900 node range with no problems daily although
our average size is smaller.  I haven't looked in the logs to see if we're
having communication problems that are masked by retries or something
though.

-- pete


On 9/12/07 12:14 AM, "Lennart Karlsson" <Lennart.Karlsson at nsc.liu.se> wrote:

> meo at intrinsity.com said:
>> Peter Wyckoff said...
>> 
>> |I'm wondering how big you've gotten maui and torque to scale, mostly
>> |interested in number of nodes?
>> |
>> |The docs say something like 1,000 but I think it scales well beyond that,
>> |no?
>> 
>> That's what I've heard.  Right now we're at about 300 nodes.
> 
> Are you able to start a parallel job spanning all of these 300 nodes
> or is the mom-to-mom communication setup breaking down?
> 
> We have problems starting jobs wider than about 100 nodes, because
> that amount of moms gets difficulties synchronizing among themselves
> at startup.
> 
> -- Lennart Karlsson <Lennart.Karlsson at nsc.liu.se>
>    National Supercomputer Centre in Linkoping, Sweden
>    http://www.nsc.liu.se
> 
> 



More information about the torqueusers mailing list