[torquedev] Trunk And Multithreading

"Mgr. Šimon Tóth" SimonT at mail.muni.cz
Sat Dec 11 08:43:54 MST 2010


> Thank you for your comments. We understand the words of caution from you 
> and Simon. We are also conscious of the fact that we are moving forward 
> with a lot of changes and not always getting input from the community. 
> There is a bit of urgency to get TORQUE to the point it can scale. SLURM 
> is moving forward with their scaling ability in response to users who 
> are currently creating clusters with 10,000 plus nodes and multiple cores.

Hmm, the problem I see is that making Torque suitable for 10k+ clusters
doesn't make much sense when you don't have a scheduler capable working
with such large systems.

Correct me if I'm wrong, but the only schedulers I know about, that are
capable of managing such sized clusters and grid middleware schedulers,
which don't need the clusters to be managed by a single batch server.

Wouldn't it make more sense to make pbs_sched usable first?

> We are hearing of plans to build systems with over 100,000 nodes and 
> right now TORQUE cannot manage such a system. I have published on this 
> list and at SC'10 what we plan for TORQUE 4.0 (3.1 in my SC'10 
> presentation). We are 1)making TORQUE mulit-threaded, 2)we are adding a 
> hierarchical job launch and 3)we will be changing the way Server-to-MOM 
> and MOM-to-MOM communication works. Any and all ideas about how to 
> improve these are welcomed and encouraged.

Is the presentation available somewhere?

We are solving this problem from the opposite direction, we are
extending the distributed system support in Torque. Basically the idea
is to split the grid into smaller sites that still externally behave as
a single server. Our current system scales easily over tenths of sites
(with each site handling whatever one server can handle). We are
currently researching and designing a system that will scale over
thousands of sites.

As for the threading support, well the problem is that this is like the
last change I would consider implementing. The code base is a mess and
introducing threads into the code will make further modifications
extremely hard.

It's already very hard to implement new features since there are many
implied protocols in the server and mom. I bumped into many problems in
both server and mom simply by increasing the amount of handling code
(which caused a delay and a cascade failure).

I still have to look through the threading code in more depth, but I
certainly don't like what I have seen until now. You are wrapping
threads around old not thread safe code. What should have been done
instead is the creation of thread safe versions of the old functions.

> We have chosen to put this work into trunk with the knowledge that it 
> will create instability. But we also have confidence that we will be 
> able to address the stability problems as the new version is deployed. 
> In the mean time we have the 2.4, 2.5 and 3.0 branches which are 
> available for use. 2.5 and 3.0 can also be improved with minor feature 
> changes as well.

It's more of a problem with following the changes, if there would be
feature branch I could just check the diff, now its mixed with all the
other stuff :-/

--
Mgr. Šimon Tóth


More information about the torquedev mailing list