[torquedev] Versioning Issues & Development Roadmap

Josh Butikofer josh at clusterresources.com
Mon Jun 1 20:09:44 MDT 2009


Josh,

Yeah, we agree that TORQUE's HA can be improved. We are actually working on making it more tolerant of NFS/network problems right now for customer. We will be rolling these enhancements in when they are ready.

I personally haven't looked into the restartable job issues you were seeing--I'll have to follow up with others and see if they've experienced something similar.

Josh Butikofer
Cluster Resources, Inc.
#############################

----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:

> I generally like what I see here,
> 
> 	But I'd really like to see some sort of work done on TORQUE's HA
> functionality. 
> Perhaps its just a documentation issue, but we weren't able to get it
> to work 
> properly and had numerous issues with the restartable flag, properly
> restarting 
> jobs once we implemented failover outside of TORQUE's own HA service.
> Our tail 
> of failover issues can be found here:
> 
> http://www.clusterresources.com/pipermail/torquedev/2009-April/001472.html
> 
> -Joshua Bernstein
> Software Engineer
> Penguin Computing
> 
> Josh Butikofer wrote:
> > Everyone,
> > 
> > As you know we've been discussing how best to get TORQUE to have a
> more sane
> > release schedule and versioning scheme. I think most of us agree
> that the 
> > current way that TORQUE does things pertaining to versioning is
> something we 
> > want to move away from. We want the stable branch of TORQUE to only
> accept bug 
> > fixes, the next minor version of TORQUE can receive features, and
> the major 
> > versions ofTORQUE can receive major refactoring, scary features,
> changes in 
> > default behavior, etc.
> > 
> > I also think we can all agree that 2.3.x should be locked down and
> no new
> > features will be allowed into it from now on. Only bug fixes will be
> allowed.
> > Same goes for 2.1.x, but I think this was implied, since Garrick has
> been taking
> > care of that branch and that appears to be his philosophy.
> > 
> > Now here comes the tricky part--what to do with 2.4.x. How do we
> shift gears mid
> > development? Anyway I look at it, there will be a little bit of pain
> and
> > imperfection as we move to the new model.
> > 
> > There has been the proposal given on this list that we should turn
> 2.4.x into
> > 3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea,
> but
> > some users (talked to CRI outside of the list) don't like this idea
> because it
> > will cause confusion. TORQUE 2.4.x has been released as beta to
> these users. It
> > has been called 2.4.x for a while. Those users expect certain
> features and
> > capabilities to be in 2.4.x. If we get rid of 2.4, without properly
> releasing 
> > it, and make a new 2.4 without the same feature-set it will be
> confusing. We 
> > also agree that this would be confusing.
> > 
> > An alternate proposal is that we get 2.4.x ready to release. It
> still needs more
> > testing and polish before it is ready for general use, but we
> believe that it
> > can be ready in several months. The refactoring that was done in
> 2.4
> > is not bad, per-se, or too large--but it did introduce bugs, many of
> which have 
> > been eliminatd. For sure there are more, but users have been running
> 2.4 for a 
> > while now without major problems. There are some backwards
> compatibility issues 
> > and changes in default behavior that cause concern. We could remove
> these and 
> > slate them for 3.x--this would make releasing 2.4 more palatable.
> > 
> > After releasing 2.4.x, we could branch of 2.5.x and start our new
> rules. A 
> > better TORQUE website could help explain the difference between 2.1,
> 2.3, and 
> > 2.4 and give recommendations which version should be used.
> > 
> > TORQUE 3.x could be branched off at any time. I would like to get
> rid of "trunk"
> > and just call the branches what they are: "3.0", etc. It might be
> less confusing.
> > 
> > Anyway, those are the two options right now. What does everyone
> think?
> > 
> > -----
> > 
> > Also, as promised, here is a roadmap proposal. As mentioned a few
> days ago, this 
> > is only a rough draft and it addresses some issues that have been
> mentioned both 
> > in the mailing list and via other channels. Everything is open to
> discussion, 
> > but I think it would do us all good to come to a consensus and even
> try to 
> > attach release dates so that users of TORQUE have a good feel for
> when they can 
> > expect new features and versions.
> > 
> > Note also that this roadmap assumes we do release TORQUE 2.4.x as it
> stands and
> > not create a "new" 2.4.
> > 
> > TORQUE 2.3.7 - release in the next few weeks (June 15th?)
> > 
> >     * Only bug fixes allowed from now on.
> > 
> > TORQUE 2.4 - possible release around August 30th?
> > 
> >     * Complete 2.3-fixes merge
> >     * A single new feature: CPU affinity (very basic
> implementation)
> >     * Start code lockdown soon and prepare for release
> >     * Get early-adopters to install and increase internal CRI
> testing
> >     * Get docs ready for this release (improve BLCR explanation,
> update MPI docs,
> > etc.)
> > 
> > TORQUE 2.5 - possible release before winter (November 1st?)
> > 
> >     * TORQUE testing framework (multi pbs_mom's model)
> >     * Eliminate need for privileged ports (configurable)
> >     * CPUsets improvements
> >     * Get job arrays out of beta
> >     * Job array dependencies
> > 
> > TORQUE 3.0 - release sometime next year?
> > 
> >     * Alternate communication model between pbs_server, MOMs, and
> sisters to
> > improve scalability on very large systems with large MPI jobs
> >     * Closer integration with MPI wireups?
> >     * Improve TORQUE's high-availability feature
> >     * Refactor code to make it easier to maintain and work with
> >     * Add code to help better support GPU's in clusters
> >     * Continue improvement of documentation
> >     * Make job save format more flexible and less brittle
> > 
> > Again, this is in no way something we have set in stone or are
> handing down as 
> > doctrine. We are interested in your additions or recommendations.
> > 
> > Please let us know what you think.
> > 
> > Thanks,
> >


More information about the torquedev mailing list