[torquedev] Versioning Issues & Development Roadmap

Josh Butikofer josh at clusterresources.com
Thu May 28 09:41:32 MDT 2009


As you know we've been discussing how best to get TORQUE to have a more sane
release schedule and versioning scheme. I think most of us agree that the 
current way that TORQUE does things pertaining to versioning is something we 
want to move away from. We want the stable branch of TORQUE to only accept bug 
fixes, the next minor version of TORQUE can receive features, and the major 
versions ofTORQUE can receive major refactoring, scary features, changes in 
default behavior, etc.

I also think we can all agree that 2.3.x should be locked down and no new
features will be allowed into it from now on. Only bug fixes will be allowed.
Same goes for 2.1.x, but I think this was implied, since Garrick has been taking
care of that branch and that appears to be his philosophy.

Now here comes the tricky part--what to do with 2.4.x. How do we shift gears mid
development? Anyway I look at it, there will be a little bit of pain and
imperfection as we move to the new model.

There has been the proposal given on this list that we should turn 2.4.x into
3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea, but
some users (talked to CRI outside of the list) don't like this idea because it
will cause confusion. TORQUE 2.4.x has been released as beta to these users. It
has been called 2.4.x for a while. Those users expect certain features and
capabilities to be in 2.4.x. If we get rid of 2.4, without properly releasing 
it, and make a new 2.4 without the same feature-set it will be confusing. We 
also agree that this would be confusing.

An alternate proposal is that we get 2.4.x ready to release. It still needs more
testing and polish before it is ready for general use, but we believe that it
can be ready in several months. The refactoring that was done in 2.4
is not bad, per-se, or too large--but it did introduce bugs, many of which have 
been eliminatd. For sure there are more, but users have been running 2.4 for a 
while now without major problems. There are some backwards compatibility issues 
and changes in default behavior that cause concern. We could remove these and 
slate them for 3.x--this would make releasing 2.4 more palatable.

After releasing 2.4.x, we could branch of 2.5.x and start our new rules. A 
better TORQUE website could help explain the difference between 2.1, 2.3, and 
2.4 and give recommendations which version should be used.

TORQUE 3.x could be branched off at any time. I would like to get rid of "trunk"
and just call the branches what they are: "3.0", etc. It might be less confusing.

Anyway, those are the two options right now. What does everyone think?


Also, as promised, here is a roadmap proposal. As mentioned a few days ago, this 
is only a rough draft and it addresses some issues that have been mentioned both 
in the mailing list and via other channels. Everything is open to discussion, 
but I think it would do us all good to come to a consensus and even try to 
attach release dates so that users of TORQUE have a good feel for when they can 
expect new features and versions.

Note also that this roadmap assumes we do release TORQUE 2.4.x as it stands and
not create a "new" 2.4.

TORQUE 2.3.7 - release in the next few weeks (June 15th?)

    * Only bug fixes allowed from now on.

TORQUE 2.4 - possible release around August 30th?

    * Complete 2.3-fixes merge
    * A single new feature: CPU affinity (very basic implementation)
    * Start code lockdown soon and prepare for release
    * Get early-adopters to install and increase internal CRI testing
    * Get docs ready for this release (improve BLCR explanation, update MPI docs,

TORQUE 2.5 - possible release before winter (November 1st?)

    * TORQUE testing framework (multi pbs_mom's model)
    * Eliminate need for privileged ports (configurable)
    * CPUsets improvements
    * Get job arrays out of beta
    * Job array dependencies

TORQUE 3.0 - release sometime next year?

    * Alternate communication model between pbs_server, MOMs, and sisters to
improve scalability on very large systems with large MPI jobs
    * Closer integration with MPI wireups?
    * Improve TORQUE's high-availability feature
    * Refactor code to make it easier to maintain and work with
    * Add code to help better support GPU's in clusters
    * Continue improvement of documentation
    * Make job save format more flexible and less brittle

Again, this is in no way something we have set in stone or are handing down as 
doctrine. We are interested in your additions or recommendations.

Please let us know what you think.


Josh Butikofer
Cluster Resources, Inc.

