[torqueusers] what version of torque to upgrade to?

David Beer dbeer at adaptivecomputing.com
Mon Feb 11 09:26:25 MST 2013


On Fri, Feb 8, 2013 at 3:24 PM, John Valdes <valdes at mcs.anl.gov> wrote:

> All,
>
> We've been using Torque 2.3.x and Maui 3.2.6px on our modest size
> (~350 nodes), production, commodity cluster successfully now for the
> last 3 years or so, and while we have encountered minor bugs every now
> and then, for the most part it has been very stable and reliable.
> Nevertheless, we're thinking that we should upgrade to a current
> version of Torque and Maui (3.3.1), partly so that we're using an
> activately maintained codebase, but also to get cgroup and better GPU
> support.  However, there are so many branches of torque available now,
> I'm not sure what version we should upgrade to.  We don't need any of
> the NUMA or scalability features of the 3.0 and 4.x branches, so
> should we stick to the 2.5.x branch?  That's getting pretty old now
> too, so maybe we should just go directly to one of the 4.x branches;
> if so, which one?
>
> Some more background, in case it factors into the decision:
>
> 1) This is a commodity cluster, using multicore CPUs (eg, Intel
>    Nehalem and Sandy Bridge) and an IB interconnect.  While the nodes
>    are technically NUMA architecture, the scale is much smaller than
>    what I believe the NUMA support in torque >= 3 intends to address,
>    so I don't think we would need the NUMA features of torque(?).
>
>
You are correct that the NUMA support from TORQUE 3 is intended for larger
NUMA machines.


> 2) As I said, this is a production cluster, so stability and proper
>    operation are critical.  Issues like the one in this thread:
>
> http://www.clusterresources.com/pipermail/torqueusers/2012-November/015236.html
>    make me nervous about upgrading. :)
>
>
I have little to no experience with Maui, so hopefully someone else can
offer some advice on this point.


> 3) We use QOS and classes fairly heavily (eg, for job prioritization
>    and for associating nodes with queues); while technically, those are
>    maui features, torque needs to cooperate properly w/ maui for those
>    to work as intended.
>
>
All versions of TORQUE should be good for this requirement.


> Any recommendations?  I can provide more information if needed.
>
>
Here's how we are developing the different branches:

2.5.x - at this point, this is a legacy branch that will only get critical
bug fixes.
3.0.x - end of life.
4.1.x - primarily a bug fix branch, but all bugs reported against it need
to be fixed.
4.2.x - the latest and greatest. Currently 4.2.0 is marked EA (early
access) as it has a few known issues. A better release of 4.2.0 should be
available this week.

It sounds like you don't require the features that are in the 4 series, so
the only consideration for whether or not you'd want to go is really
upgrading in the future. Any upgrade from something less than 4 to 4 or
higher is a complete cluster upgrade - the protocol for the moms to talk to
the server has changed and so moms from before 4 can't communicate with the
server from the 4. This may be a really small consideration for you if you
don't plan to upgrade again, but hopefully this can inform your decision a
bit.

-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130211/6297756f/attachment-0001.html 


More information about the torqueusers mailing list