[torquedev] Versioning Issues & Development Roadmap
jbernstein at penguincomputing.com
Wed Jun 3 12:42:04 MDT 2009
Josh Butikofer wrote:
> Yeah, we agree that TORQUE's HA can be improved. We are actually working on
> making it more tolerant of NFS/network problems right now for customer. We
> will be rolling these enhancements in when they are ready.
> I personally haven't looked into the restartable job issues you were
> seeing--I'll have to follow up with others and see if they've experienced
> something similar.
I would very much appreciate it. I believe we suggested a fix, and it does seem
to work, but we haven't spent too much time making sure it is indeed the right fix.
> Josh Butikofer Cluster Resources, Inc. #############################
> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
>> I generally like what I see here,
>> But I'd really like to see some sort of work done on TORQUE's HA
>> functionality. Perhaps its just a documentation issue, but we weren't able
>> to get it to work properly and had numerous issues with the restartable
>> flag, properly restarting jobs once we implemented failover outside of
>> TORQUE's own HA service. Our tail of failover issues can be found here:
>> -Joshua Bernstein Software Engineer Penguin Computing
>> Josh Butikofer wrote:
>>> As you know we've been discussing how best to get TORQUE to have a
>> more sane
>>> release schedule and versioning scheme. I think most of us agree
>> that the
>>> current way that TORQUE does things pertaining to versioning is
>> something we
>>> want to move away from. We want the stable branch of TORQUE to only
>> accept bug
>>> fixes, the next minor version of TORQUE can receive features, and
>> the major
>>> versions ofTORQUE can receive major refactoring, scary features,
>> changes in
>>> default behavior, etc.
>>> I also think we can all agree that 2.3.x should be locked down and
>> no new
>>> features will be allowed into it from now on. Only bug fixes will be
>>> Same goes for 2.1.x, but I think this was implied, since Garrick has
>> been taking
>>> care of that branch and that appears to be his philosophy.
>>> Now here comes the tricky part--what to do with 2.4.x. How do we
>> shift gears mid
>>> development? Anyway I look at it, there will be a little bit of pain
>>> imperfection as we move to the new model.
>>> There has been the proposal given on this list that we should turn
>> 2.4.x into
>>> 3.x and then make a new 2.4.x off of 2.3.x. This is not a bad idea,
>>> some users (talked to CRI outside of the list) don't like this idea
>> because it
>>> will cause confusion. TORQUE 2.4.x has been released as beta to
>> these users. It
>>> has been called 2.4.x for a while. Those users expect certain
>> features and
>>> capabilities to be in 2.4.x. If we get rid of 2.4, without properly
>>> it, and make a new 2.4 without the same feature-set it will be
>> confusing. We
>>> also agree that this would be confusing.
>>> An alternate proposal is that we get 2.4.x ready to release. It
>> still needs more
>>> testing and polish before it is ready for general use, but we
>> believe that it
>>> can be ready in several months. The refactoring that was done in
>>> is not bad, per-se, or too large--but it did introduce bugs, many of
>> which have
>>> been eliminatd. For sure there are more, but users have been running
>> 2.4 for a
>>> while now without major problems. There are some backwards
>> compatibility issues
>>> and changes in default behavior that cause concern. We could remove
>> these and
>>> slate them for 3.x--this would make releasing 2.4 more palatable.
>>> After releasing 2.4.x, we could branch of 2.5.x and start our new
>> rules. A
>>> better TORQUE website could help explain the difference between 2.1,
>> 2.3, and
>>> 2.4 and give recommendations which version should be used.
>>> TORQUE 3.x could be branched off at any time. I would like to get
>> rid of "trunk"
>>> and just call the branches what they are: "3.0", etc. It might be
>> less confusing.
>>> Anyway, those are the two options right now. What does everyone
>>> Also, as promised, here is a roadmap proposal. As mentioned a few
>> days ago, this
>>> is only a rough draft and it addresses some issues that have been
>> mentioned both
>>> in the mailing list and via other channels. Everything is open to
>>> but I think it would do us all good to come to a consensus and even
>> try to
>>> attach release dates so that users of TORQUE have a good feel for
>> when they can
>>> expect new features and versions.
>>> Note also that this roadmap assumes we do release TORQUE 2.4.x as it
>> stands and
>>> not create a "new" 2.4.
>>> TORQUE 2.3.7 - release in the next few weeks (June 15th?)
>>> * Only bug fixes allowed from now on.
>>> TORQUE 2.4 - possible release around August 30th?
>>> * Complete 2.3-fixes merge * A single new feature: CPU affinity (very
>>> * Start code lockdown soon and prepare for release * Get early-adopters
>>> to install and increase internal CRI
>>> * Get docs ready for this release (improve BLCR explanation,
>> update MPI docs,
>>> TORQUE 2.5 - possible release before winter (November 1st?)
>>> * TORQUE testing framework (multi pbs_mom's model) * Eliminate need for
>>> privileged ports (configurable) * CPUsets improvements * Get job arrays
>>> out of beta * Job array dependencies
>>> TORQUE 3.0 - release sometime next year?
>>> * Alternate communication model between pbs_server, MOMs, and
>> sisters to
>>> improve scalability on very large systems with large MPI jobs * Closer
>>> integration with MPI wireups? * Improve TORQUE's high-availability
>>> feature * Refactor code to make it easier to maintain and work with * Add
>>> code to help better support GPU's in clusters * Continue improvement of
>>> documentation * Make job save format more flexible and less brittle
>>> Again, this is in no way something we have set in stone or are
>> handing down as
>>> doctrine. We are interested in your additions or recommendations.
>>> Please let us know what you think.
More information about the torquedev