[torquedev] Trunk And Multithreading
knielson at adaptivecomputing.com
Fri Dec 10 12:40:18 MST 2010
On 12/10/2010 10:11 AM, Glen Beane wrote:
> I like the model we used for the major job-array changes where they
> were developed in a branch and folded back in when they were stable,
> but if TORQUE 4.0 will be delayed until multi-threading is rock solid
> then I can live with it in trunk. I think rushing this will be a big
> mistake, and I'm sure you have no such plans, but sometimes there are
> pressures to get a release out the door before it is really ready...
> I'd like to think a little about the resizable array and out of order
> jobs issue, although that is minor at this point - stability is
> definitely higher priority.
Thank you for your comments. We understand the words of caution from you
and Simon. We are also conscious of the fact that we are moving forward
with a lot of changes and not always getting input from the community.
There is a bit of urgency to get TORQUE to the point it can scale. SLURM
is moving forward with their scaling ability in response to users who
are currently creating clusters with 10,000 plus nodes and multiple cores.
We are hearing of plans to build systems with over 100,000 nodes and
right now TORQUE cannot manage such a system. I have published on this
list and at SC'10 what we plan for TORQUE 4.0 (3.1 in my SC'10
presentation). We are 1)making TORQUE mulit-threaded, 2)we are adding a
hierarchical job launch and 3)we will be changing the way Server-to-MOM
and MOM-to-MOM communication works. Any and all ideas about how to
improve these are welcomed and encouraged.
We have chosen to put this work into trunk with the knowledge that it
will create instability. But we also have confidence that we will be
able to address the stability problems as the new version is deployed.
In the mean time we have the 2.4, 2.5 and 3.0 branches which are
available for use. 2.5 and 3.0 can also be improved with minor feature
changes as well.
Contributions to the code and conversation are encouraged. We want to be
open while at the same time move ahead with the changes needed to keep
TORQUE relevant as a resource manager.
Thanks again for your comments.
More information about the torquedev