[torquedev] Trunk And Multithreading
dbeer at adaptivecomputing.com
Mon Dec 13 10:16:05 MST 2010
----- Original Message -----
> 2010/12/11 "Mgr. Šimon Tóth" <SimonT at mail.muni.cz>:
> >> Thank you for your comments. We understand the words of caution
> >> from you
> >> and Simon. We are also conscious of the fact that we are moving
> >> forward
> >> with a lot of changes and not always getting input from the
> >> community.
> >> There is a bit of urgency to get TORQUE to the point it can scale.
> >> SLURM
> >> is moving forward with their scaling ability in response to users
> >> who
> >> are currently creating clusters with 10,000 plus nodes and multiple
> >> cores.
> > Hmm, the problem I see is that making Torque suitable for 10k+
> > clusters
> > doesn't make much sense when you don't have a scheduler capable
> > working
> > with such large systems.
> > Correct me if I'm wrong, but the only schedulers I know about, that
> > are
> > capable of managing such sized clusters and grid middleware
> > schedulers,
> > which don't need the clusters to be managed by a single batch
> > server.
> > Wouldn't it make more sense to make pbs_sched usable first?
> I think you are the only person using pbs_sched for a non-trivial
> setup. Everyone else uses Maui (which is free) or Moab. Moab is
> capable of managing clusters of that size, I don't know how well Maui
> scales, but I'm sure people are using it for 1k+ nodes.
Just a reminder as well, we at Adaptive Computing do not support pbs_sched. To us this software doesn't exist.
> >> We are hearing of plans to build systems with over 100,000 nodes
> >> and
> >> right now TORQUE cannot manage such a system. I have published on
> >> this
> >> list and at SC'10 what we plan for TORQUE 4.0 (3.1 in my SC'10
> >> presentation). We are 1)making TORQUE mulit-threaded, 2)we are
> >> adding a
> >> hierarchical job launch and 3)we will be changing the way
> >> Server-to-MOM
> >> and MOM-to-MOM communication works. Any and all ideas about how to
> >> improve these are welcomed and encouraged.
> > Is the presentation available somewhere?
> > We are solving this problem from the opposite direction, we are
> > extending the distributed system support in Torque. Basically the
> > idea
> > is to split the grid into smaller sites that still externally behave
> > as
> > a single server. Our current system scales easily over tenths of
> > sites
> > (with each site handling whatever one server can handle). We are
> > currently researching and designing a system that will scale over
> > thousands of sites.
> > As for the threading support, well the problem is that this is like
> > the
> > last change I would consider implementing. The code base is a mess
> > and
> > introducing threads into the code will make further modifications
> > extremely hard.
> I agree the code is a mess, and adding more #ifdef'd out features
> makes it even worse
The #ifdefs aren't going to be there when the code goes into production. These will be removed eventually. We aren't making completely new functions because we aren't going to maintain a non-threadsafe version of TORQUE.
> > It's already very hard to implement new features since there are
> > many
> > implied protocols in the server and mom. I bumped into many problems
> > in
> > both server and mom simply by increasing the amount of handling code
> > (which caused a delay and a cascade failure).
> > I still have to look through the threading code in more depth, but I
> > certainly don't like what I have seen until now. You are wrapping
> > threads around old not thread safe code. What should have been done
> > instead is the creation of thread safe versions of the old
> > functions.
> >> We have chosen to put this work into trunk with the knowledge that
> >> it
> >> will create instability. But we also have confidence that we will
> >> be
> >> able to address the stability problems as the new version is
> >> deployed.
> >> In the mean time we have the 2.4, 2.5 and 3.0 branches which are
> >> available for use. 2.5 and 3.0 can also be improved with minor
> >> feature
> >> changes as well.
> > It's more of a problem with following the changes, if there would be
> > feature branch I could just check the diff, now its mixed with all
> > the
> > other stuff :-/
> this is a valid point, a branch that isolates the threading changes
> makes it easier to see which changes are specifically to address
I'm not sure that it would. Any new branch would have to be kept in sync with the trunk, and so there would be lots of other things in there anyway. I think that creating a new branch would create tons of work when it has to be merged back in for at most a marginal benefit.
Direct Line: 801-717-3386 | Fax: 801-717-3738
1656 S. East Bay Blvd. Suite #300
Provo, UT 84606
More information about the torquedev