[torquedev] Trunk And Multithreading
glen.beane at gmail.com
Mon Dec 13 10:25:45 MST 2010
On Mon, Dec 13, 2010 at 12:16 PM, David Beer
<dbeer at adaptivecomputing.com> wrote:
> ----- Original Message -----
>> 2010/12/11 "Mgr. Šimon Tóth" <SimonT at mail.muni.cz>:
>> >> Thank you for your comments. We understand the words of caution
>> >> from you
>> >> and Simon. We are also conscious of the fact that we are moving
>> >> forward
>> >> with a lot of changes and not always getting input from the
>> >> community.
>> >> There is a bit of urgency to get TORQUE to the point it can scale.
>> >> SLURM
>> >> is moving forward with their scaling ability in response to users
>> >> who
>> >> are currently creating clusters with 10,000 plus nodes and multiple
>> >> cores.
>> > Hmm, the problem I see is that making Torque suitable for 10k+
>> > clusters
>> > doesn't make much sense when you don't have a scheduler capable
>> > working
>> > with such large systems.
>> > Correct me if I'm wrong, but the only schedulers I know about, that
>> > are
>> > capable of managing such sized clusters and grid middleware
>> > schedulers,
>> > which don't need the clusters to be managed by a single batch
>> > server.
>> > Wouldn't it make more sense to make pbs_sched usable first?
>> I think you are the only person using pbs_sched for a non-trivial
>> setup. Everyone else uses Maui (which is free) or Moab. Moab is
>> capable of managing clusters of that size, I don't know how well Maui
>> scales, but I'm sure people are using it for 1k+ nodes.
> Just a reminder as well, we at Adaptive Computing do not support pbs_sched. To us this software doesn't exist.
>> >> We are hearing of plans to build systems with over 100,000 nodes
>> >> and
>> >> right now TORQUE cannot manage such a system. I have published on
>> >> this
>> >> list and at SC'10 what we plan for TORQUE 4.0 (3.1 in my SC'10
>> >> presentation). We are 1)making TORQUE mulit-threaded, 2)we are
>> >> adding a
>> >> hierarchical job launch and 3)we will be changing the way
>> >> Server-to-MOM
>> >> and MOM-to-MOM communication works. Any and all ideas about how to
>> >> improve these are welcomed and encouraged.
>> > Is the presentation available somewhere?
>> > We are solving this problem from the opposite direction, we are
>> > extending the distributed system support in Torque. Basically the
>> > idea
>> > is to split the grid into smaller sites that still externally behave
>> > as
>> > a single server. Our current system scales easily over tenths of
>> > sites
>> > (with each site handling whatever one server can handle). We are
>> > currently researching and designing a system that will scale over
>> > thousands of sites.
>> > As for the threading support, well the problem is that this is like
>> > the
>> > last change I would consider implementing. The code base is a mess
>> > and
>> > introducing threads into the code will make further modifications
>> > extremely hard.
>> I agree the code is a mess, and adding more #ifdef'd out features
>> makes it even worse
> The #ifdefs aren't going to be there when the code goes into production. These will be removed eventually. We aren't making completely new functions because we aren't going to maintain a non-threadsafe version of TORQUE.
What is the time line for releasing 4.0? Forcing the new
multi-threading code means that in all likelihood 4.0 will have some
stability issues. TORQUE 2.2 suffered from many problems, and many of
them were introduced by innocent code refactoring.
More information about the torquedev