[torquedev] Versioning Issues & Development Roadmap

Joshua Bernstein jbernstein at penguincomputing.com
Wed Jun 3 12:42:04 MDT 2009



Josh Butikofer wrote:
> Josh,
> 
> Yeah, we agree that TORQUE's HA can be improved. We are actually working on
> making it more tolerant of NFS/network problems right now for customer. We
> will be rolling these enhancements in when they are ready.
> 
> I personally haven't looked into the restartable job issues you were
> seeing--I'll have to follow up with others and see if they've experienced
> something similar.

I would very much appreciate it. I believe we suggested a fix, and it does seem 
to work, but we haven't spent too much time making sure it is indeed the right fix.

-Joshua Bernstein
Software Engineer
Penguin Computing

> Josh Butikofer Cluster Resources, Inc. #############################
> 
> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
> 
>> I generally like what I see here,
>> 
>> But I'd really like to see some sort of work done on TORQUE's HA 
>> functionality. Perhaps its just a documentation issue, but we weren't able
>> to get it to work properly and had numerous issues with the restartable
>> flag, properly restarting jobs once we implemented failover outside of
>> TORQUE's own HA service. Our tail of failover issues can be found here:
>> 
>> http://www.clusterresources.com/pipermail/torquedev/2009-April/001472.html
>> 
>> -Joshua Bernstein Software Engineer Penguin Computing
>> 
>> Josh Butikofer wrote:
>>> Everyone,
>>> 
>>> As you know we've been discussing how best to get TORQUE to have a
>> more sane
>>> release schedule and versioning scheme. I think most of us agree
>> that the
>>> current way that TORQUE does things pertaining to versioning is
>> something we
>>> want to move away from. We want the stable branch of TORQUE to only
>> accept bug
>>> fixes, the next minor version of TORQUE can receive features, and
>> the major
>>> versions ofTORQUE can receive major refactoring, scary features,
>> changes in
>>> default behavior, etc.
>>> 
>>> I also think we can all agree that 2.3.x should be locked down and
>> no new
>>> features will be allowed into it from now on. Only bug fixes will be
>> allowed.
>>> Same goes for 2.1.x, but I think this was implied, since Garrick has
>> been taking
>>> care of that branch and that appears to be his philosophy.
>>> 
>>> Now here comes the tricky part--what to do with 2.4.x. How do we
>> shift gears mid
>>> development? Anyway I look at it, there will be a little bit of pain
>> and
>>> imperfection as we move to the new model.
>>> 
>>> There has been the proposal given on this list that we should turn
>> 2.4.x into
>>> 3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea,
>> but
>>> some users (talked to CRI outside of the list) don't like this idea
>> because it
>>> will cause confusion. TORQUE 2.4.x has been released as beta to
>> these users. It
>>> has been called 2.4.x for a while. Those users expect certain
>> features and
>>> capabilities to be in 2.4.x. If we get rid of 2.4, without properly
>> releasing
>>> it, and make a new 2.4 without the same feature-set it will be
>> confusing. We
>>> also agree that this would be confusing.
>>> 
>>> An alternate proposal is that we get 2.4.x ready to release. It
>> still needs more
>>> testing and polish before it is ready for general use, but we
>> believe that it
>>> can be ready in several months. The refactoring that was done in
>> 2.4
>>> is not bad, per-se, or too large--but it did introduce bugs, many of
>> which have
>>> been eliminatd. For sure there are more, but users have been running
>> 2.4 for a
>>> while now without major problems. There are some backwards
>> compatibility issues
>>> and changes in default behavior that cause concern. We could remove
>> these and
>>> slate them for 3.x--this would make releasing 2.4 more palatable.
>>> 
>>> After releasing 2.4.x, we could branch of 2.5.x and start our new
>> rules. A
>>> better TORQUE website could help explain the difference between 2.1,
>> 2.3, and
>>> 2.4 and give recommendations which version should be used.
>>> 
>>> TORQUE 3.x could be branched off at any time. I would like to get
>> rid of "trunk"
>>> and just call the branches what they are: "3.0", etc. It might be
>> less confusing.
>>> Anyway, those are the two options right now. What does everyone
>> think?
>>> -----
>>> 
>>> Also, as promised, here is a roadmap proposal. As mentioned a few
>> days ago, this
>>> is only a rough draft and it addresses some issues that have been
>> mentioned both
>>> in the mailing list and via other channels. Everything is open to
>> discussion,
>>> but I think it would do us all good to come to a consensus and even
>> try to
>>> attach release dates so that users of TORQUE have a good feel for
>> when they can
>>> expect new features and versions.
>>> 
>>> Note also that this roadmap assumes we do release TORQUE 2.4.x as it
>> stands and
>>> not create a "new" 2.4.
>>> 
>>> TORQUE 2.3.7 - release in the next few weeks (June 15th?)
>>> 
>>> * Only bug fixes allowed from now on.
>>> 
>>> TORQUE 2.4 - possible release around August 30th?
>>> 
>>> * Complete 2.3-fixes merge * A single new feature: CPU affinity (very
>>> basic
>> implementation)
>>> * Start code lockdown soon and prepare for release * Get early-adopters
>>> to install and increase internal CRI
>> testing
>>> * Get docs ready for this release (improve BLCR explanation,
>> update MPI docs,
>>> etc.)
>>> 
>>> TORQUE 2.5 - possible release before winter (November 1st?)
>>> 
>>> * TORQUE testing framework (multi pbs_mom's model) * Eliminate need for
>>> privileged ports (configurable) * CPUsets improvements * Get job arrays
>>> out of beta * Job array dependencies
>>> 
>>> TORQUE 3.0 - release sometime next year?
>>> 
>>> * Alternate communication model between pbs_server, MOMs, and
>> sisters to
>>> improve scalability on very large systems with large MPI jobs * Closer
>>> integration with MPI wireups? * Improve TORQUE's high-availability
>>> feature * Refactor code to make it easier to maintain and work with * Add
>>> code to help better support GPU's in clusters * Continue improvement of
>>> documentation * Make job save format more flexible and less brittle
>>> 
>>> Again, this is in no way something we have set in stone or are
>> handing down as
>>> doctrine. We are interested in your additions or recommendations.
>>> 
>>> Please let us know what you think.
>>> 
>>> Thanks,
>>> 


More information about the torquedev mailing list