[torqueusers] what version of torque to upgrade to?

John Valdes valdes at mcs.anl.gov
Fri Feb 8 15:24:09 MST 2013


We've been using Torque 2.3.x and Maui 3.2.6px on our modest size
(~350 nodes), production, commodity cluster successfully now for the
last 3 years or so, and while we have encountered minor bugs every now
and then, for the most part it has been very stable and reliable.
Nevertheless, we're thinking that we should upgrade to a current
version of Torque and Maui (3.3.1), partly so that we're using an
activately maintained codebase, but also to get cgroup and better GPU
support.  However, there are so many branches of torque available now,
I'm not sure what version we should upgrade to.  We don't need any of
the NUMA or scalability features of the 3.0 and 4.x branches, so
should we stick to the 2.5.x branch?  That's getting pretty old now
too, so maybe we should just go directly to one of the 4.x branches;
if so, which one?

Some more background, in case it factors into the decision:

1) This is a commodity cluster, using multicore CPUs (eg, Intel
   Nehalem and Sandy Bridge) and an IB interconnect.  While the nodes
   are technically NUMA architecture, the scale is much smaller than
   what I believe the NUMA support in torque >= 3 intends to address,
   so I don't think we would need the NUMA features of torque(?).

2) As I said, this is a production cluster, so stability and proper
   operation are critical.  Issues like the one in this thread:
   make me nervous about upgrading. :)

3) We use QOS and classes fairly heavily (eg, for job prioritization
   and for associating nodes with queues); while technically, those are
   maui features, torque needs to cooperate properly w/ maui for those
   to work as intended.

Any recommendations?  I can provide more information if needed.

Thanks in advance.


John Valdes                  Mathematics and Computer Science Division
valdes at mcs.anl.gov                         Argonne National Laboratory

