[torquedev] Versioning Issues & Development Roadmap
josh at clusterresources.com
Mon Jun 1 20:09:44 MDT 2009
Yeah, we agree that TORQUE's HA can be improved. We are actually working on making it more tolerant of NFS/network problems right now for customer. We will be rolling these enhancements in when they are ready.
I personally haven't looked into the restartable job issues you were seeing--I'll have to follow up with others and see if they've experienced something similar.
Cluster Resources, Inc.
----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
> I generally like what I see here,
> But I'd really like to see some sort of work done on TORQUE's HA
> Perhaps its just a documentation issue, but we weren't able to get it
> to work
> properly and had numerous issues with the restartable flag, properly
> jobs once we implemented failover outside of TORQUE's own HA service.
> Our tail
> of failover issues can be found here:
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
> Josh Butikofer wrote:
> > Everyone,
> > As you know we've been discussing how best to get TORQUE to have a
> more sane
> > release schedule and versioning scheme. I think most of us agree
> that the
> > current way that TORQUE does things pertaining to versioning is
> something we
> > want to move away from. We want the stable branch of TORQUE to only
> accept bug
> > fixes, the next minor version of TORQUE can receive features, and
> the major
> > versions ofTORQUE can receive major refactoring, scary features,
> changes in
> > default behavior, etc.
> > I also think we can all agree that 2.3.x should be locked down and
> no new
> > features will be allowed into it from now on. Only bug fixes will be
> > Same goes for 2.1.x, but I think this was implied, since Garrick has
> been taking
> > care of that branch and that appears to be his philosophy.
> > Now here comes the tricky part--what to do with 2.4.x. How do we
> shift gears mid
> > development? Anyway I look at it, there will be a little bit of pain
> > imperfection as we move to the new model.
> > There has been the proposal given on this list that we should turn
> 2.4.x into
> > 3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea,
> > some users (talked to CRI outside of the list) don't like this idea
> because it
> > will cause confusion. TORQUE 2.4.x has been released as beta to
> these users. It
> > has been called 2.4.x for a while. Those users expect certain
> features and
> > capabilities to be in 2.4.x. If we get rid of 2.4, without properly
> > it, and make a new 2.4 without the same feature-set it will be
> confusing. We
> > also agree that this would be confusing.
> > An alternate proposal is that we get 2.4.x ready to release. It
> still needs more
> > testing and polish before it is ready for general use, but we
> believe that it
> > can be ready in several months. The refactoring that was done in
> > is not bad, per-se, or too large--but it did introduce bugs, many of
> which have
> > been eliminatd. For sure there are more, but users have been running
> 2.4 for a
> > while now without major problems. There are some backwards
> compatibility issues
> > and changes in default behavior that cause concern. We could remove
> these and
> > slate them for 3.x--this would make releasing 2.4 more palatable.
> > After releasing 2.4.x, we could branch of 2.5.x and start our new
> rules. A
> > better TORQUE website could help explain the difference between 2.1,
> 2.3, and
> > 2.4 and give recommendations which version should be used.
> > TORQUE 3.x could be branched off at any time. I would like to get
> rid of "trunk"
> > and just call the branches what they are: "3.0", etc. It might be
> less confusing.
> > Anyway, those are the two options right now. What does everyone
> > -----
> > Also, as promised, here is a roadmap proposal. As mentioned a few
> days ago, this
> > is only a rough draft and it addresses some issues that have been
> mentioned both
> > in the mailing list and via other channels. Everything is open to
> > but I think it would do us all good to come to a consensus and even
> try to
> > attach release dates so that users of TORQUE have a good feel for
> when they can
> > expect new features and versions.
> > Note also that this roadmap assumes we do release TORQUE 2.4.x as it
> stands and
> > not create a "new" 2.4.
> > TORQUE 2.3.7 - release in the next few weeks (June 15th?)
> > * Only bug fixes allowed from now on.
> > TORQUE 2.4 - possible release around August 30th?
> > * Complete 2.3-fixes merge
> > * A single new feature: CPU affinity (very basic
> > * Start code lockdown soon and prepare for release
> > * Get early-adopters to install and increase internal CRI
> > * Get docs ready for this release (improve BLCR explanation,
> update MPI docs,
> > etc.)
> > TORQUE 2.5 - possible release before winter (November 1st?)
> > * TORQUE testing framework (multi pbs_mom's model)
> > * Eliminate need for privileged ports (configurable)
> > * CPUsets improvements
> > * Get job arrays out of beta
> > * Job array dependencies
> > TORQUE 3.0 - release sometime next year?
> > * Alternate communication model between pbs_server, MOMs, and
> sisters to
> > improve scalability on very large systems with large MPI jobs
> > * Closer integration with MPI wireups?
> > * Improve TORQUE's high-availability feature
> > * Refactor code to make it easier to maintain and work with
> > * Add code to help better support GPU's in clusters
> > * Continue improvement of documentation
> > * Make job save format more flexible and less brittle
> > Again, this is in no way something we have set in stone or are
> handing down as
> > doctrine. We are interested in your additions or recommendations.
> > Please let us know what you think.
> > Thanks,
More information about the torquedev