[torqueusers] what version of torque to upgrade to?
dbeer at adaptivecomputing.com
Mon Feb 11 09:26:25 MST 2013
On Fri, Feb 8, 2013 at 3:24 PM, John Valdes <valdes at mcs.anl.gov> wrote:
> We've been using Torque 2.3.x and Maui 3.2.6px on our modest size
> (~350 nodes), production, commodity cluster successfully now for the
> last 3 years or so, and while we have encountered minor bugs every now
> and then, for the most part it has been very stable and reliable.
> Nevertheless, we're thinking that we should upgrade to a current
> version of Torque and Maui (3.3.1), partly so that we're using an
> activately maintained codebase, but also to get cgroup and better GPU
> support. However, there are so many branches of torque available now,
> I'm not sure what version we should upgrade to. We don't need any of
> the NUMA or scalability features of the 3.0 and 4.x branches, so
> should we stick to the 2.5.x branch? That's getting pretty old now
> too, so maybe we should just go directly to one of the 4.x branches;
> if so, which one?
> Some more background, in case it factors into the decision:
> 1) This is a commodity cluster, using multicore CPUs (eg, Intel
> Nehalem and Sandy Bridge) and an IB interconnect. While the nodes
> are technically NUMA architecture, the scale is much smaller than
> what I believe the NUMA support in torque >= 3 intends to address,
> so I don't think we would need the NUMA features of torque(?).
You are correct that the NUMA support from TORQUE 3 is intended for larger
> 2) As I said, this is a production cluster, so stability and proper
> operation are critical. Issues like the one in this thread:
> make me nervous about upgrading. :)
I have little to no experience with Maui, so hopefully someone else can
offer some advice on this point.
> 3) We use QOS and classes fairly heavily (eg, for job prioritization
> and for associating nodes with queues); while technically, those are
> maui features, torque needs to cooperate properly w/ maui for those
> to work as intended.
All versions of TORQUE should be good for this requirement.
> Any recommendations? I can provide more information if needed.
Here's how we are developing the different branches:
2.5.x - at this point, this is a legacy branch that will only get critical
3.0.x - end of life.
4.1.x - primarily a bug fix branch, but all bugs reported against it need
to be fixed.
4.2.x - the latest and greatest. Currently 4.2.0 is marked EA (early
access) as it has a few known issues. A better release of 4.2.0 should be
available this week.
It sounds like you don't require the features that are in the 4 series, so
the only consideration for whether or not you'd want to go is really
upgrading in the future. Any upgrade from something less than 4 to 4 or
higher is a complete cluster upgrade - the protocol for the moms to talk to
the server has changed and so moms from before 4 can't communicate with the
server from the 4. This may be a really small consideration for you if you
don't plan to upgrade again, but hopefully this can inform your decision a
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers