[torquedev] Versioning Issues & Development Roadmap
jbernstein at penguincomputing.com
Mon Jun 1 17:16:03 MDT 2009
I generally like what I see here,
But I'd really like to see some sort of work done on TORQUE's HA functionality.
Perhaps its just a documentation issue, but we weren't able to get it to work
properly and had numerous issues with the restartable flag, properly restarting
jobs once we implemented failover outside of TORQUE's own HA service. Our tail
of failover issues can be found here:
Josh Butikofer wrote:
> As you know we've been discussing how best to get TORQUE to have a more sane
> release schedule and versioning scheme. I think most of us agree that the
> current way that TORQUE does things pertaining to versioning is something we
> want to move away from. We want the stable branch of TORQUE to only accept bug
> fixes, the next minor version of TORQUE can receive features, and the major
> versions ofTORQUE can receive major refactoring, scary features, changes in
> default behavior, etc.
> I also think we can all agree that 2.3.x should be locked down and no new
> features will be allowed into it from now on. Only bug fixes will be allowed.
> Same goes for 2.1.x, but I think this was implied, since Garrick has been taking
> care of that branch and that appears to be his philosophy.
> Now here comes the tricky part--what to do with 2.4.x. How do we shift gears mid
> development? Anyway I look at it, there will be a little bit of pain and
> imperfection as we move to the new model.
> There has been the proposal given on this list that we should turn 2.4.x into
> 3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea, but
> some users (talked to CRI outside of the list) don't like this idea because it
> will cause confusion. TORQUE 2.4.x has been released as beta to these users. It
> has been called 2.4.x for a while. Those users expect certain features and
> capabilities to be in 2.4.x. If we get rid of 2.4, without properly releasing
> it, and make a new 2.4 without the same feature-set it will be confusing. We
> also agree that this would be confusing.
> An alternate proposal is that we get 2.4.x ready to release. It still needs more
> testing and polish before it is ready for general use, but we believe that it
> can be ready in several months. The refactoring that was done in 2.4
> is not bad, per-se, or too large--but it did introduce bugs, many of which have
> been eliminatd. For sure there are more, but users have been running 2.4 for a
> while now without major problems. There are some backwards compatibility issues
> and changes in default behavior that cause concern. We could remove these and
> slate them for 3.x--this would make releasing 2.4 more palatable.
> After releasing 2.4.x, we could branch of 2.5.x and start our new rules. A
> better TORQUE website could help explain the difference between 2.1, 2.3, and
> 2.4 and give recommendations which version should be used.
> TORQUE 3.x could be branched off at any time. I would like to get rid of "trunk"
> and just call the branches what they are: "3.0", etc. It might be less confusing.
> Anyway, those are the two options right now. What does everyone think?
> Also, as promised, here is a roadmap proposal. As mentioned a few days ago, this
> is only a rough draft and it addresses some issues that have been mentioned both
> in the mailing list and via other channels. Everything is open to discussion,
> but I think it would do us all good to come to a consensus and even try to
> attach release dates so that users of TORQUE have a good feel for when they can
> expect new features and versions.
> Note also that this roadmap assumes we do release TORQUE 2.4.x as it stands and
> not create a "new" 2.4.
> TORQUE 2.3.7 - release in the next few weeks (June 15th?)
> * Only bug fixes allowed from now on.
> TORQUE 2.4 - possible release around August 30th?
> * Complete 2.3-fixes merge
> * A single new feature: CPU affinity (very basic implementation)
> * Start code lockdown soon and prepare for release
> * Get early-adopters to install and increase internal CRI testing
> * Get docs ready for this release (improve BLCR explanation, update MPI docs,
> TORQUE 2.5 - possible release before winter (November 1st?)
> * TORQUE testing framework (multi pbs_mom's model)
> * Eliminate need for privileged ports (configurable)
> * CPUsets improvements
> * Get job arrays out of beta
> * Job array dependencies
> TORQUE 3.0 - release sometime next year?
> * Alternate communication model between pbs_server, MOMs, and sisters to
> improve scalability on very large systems with large MPI jobs
> * Closer integration with MPI wireups?
> * Improve TORQUE's high-availability feature
> * Refactor code to make it easier to maintain and work with
> * Add code to help better support GPU's in clusters
> * Continue improvement of documentation
> * Make job save format more flexible and less brittle
> Again, this is in no way something we have set in stone or are handing down as
> doctrine. We are interested in your additions or recommendations.
> Please let us know what you think.
More information about the torquedev