[torqueusers] How to upgrade 3.x -> 4.x
d-ulrick at comcast.net
Thu Jun 28 08:44:29 MDT 2012
I'm sysadmin for a new Linux-based HPC cluster that consists of login
node, two storage nodes, and 60 compute nodes. The hardware vendor
preinstalled it with TORQUE 3.0.4 with Moab 6.1.5 as the scheduler. I've
been tasked with upgrading TORQUE to a newer release, hopefully 4.0.2.
Also, I've been asked to install the TORQUE PAM module on the compute
nodes in order to prevent non-root users from logging in interactively to
a node unless they have a reservation to that node.
At a high level the upgrade path makes sense, but as for the details I'm a
bit at sea. I have 10+ years experience with Linux and 20+ years as an IBM
mainframe sysprog but I'm brand-new to the HPC as well as TORQUE and Moab.
Therefore, I'm hoping you folks can offer some feedback insofar as how I
might most easily accomplish the upgrade.
Right now TORQUE 3.0.4 and Moab 6.1.5 are running on the first storage
node with pbs_mom (also TORQUE 3.0.4) running on all 60 compute nodes. To
convert to TORQUE 4.0.2, I've cooked up this plan:
1. Clone the Moab directory tree to prepare for running a test Moab on
storage node 1. The new Moab will be connected to the new TORQUE.
2. Modify the production TORQUE and Moab to use only 58 of the compute
nodes. The remaining 2 nodes will be used by the new TORQUE.
3. Install a TORQUE 4.0.2 pbs_server on storage node 2. Configure this
server to use the 2 compute nodes.
4. Install the TORQUE 4.0.2 pbs_mom and trqauthd services and PAM module
on the 2 compute nodes.
5. Modify the configuration under /var/spool/torque on the 2 compute nodes
to point to storage node 2.
6. Modify the configuration under the test Moab directory tree to listen
on an alternate port and use the 2 compute nodes.
7. Install the new TORQUE commands to an alternate directory on the login
8. At this point, I should have the production TORQUE and Moab continuing
to manage 58 of the compute nodes, with the new TORQUE and test Moab
managing the other 2 nodes.
9. Test the new TORQUE in conjunction with the test Moab, most likely from
a 'bash' shell whose environment includes the new TORQUE binaries in the
PATH and points to the alternate Moab server via the appropriate
Assuming that testing confirms that I have a stable configuration for the
new TORQUE version, I'd then take over the whole cluster for a few hours
and do the following:
1. Shut down both production and test Moab and TORQUE (pbs_server and all
2. On the login node, storage node 1, and 58 compute nodes, move the old
TORQUE binaries, etc., to an alternate directory and copy in the new
TORQUE files in their place. With this done, all 60 compute nodes should
have identical TORQUE directory trees.
3. Configure the production Moab and TORQUE to manage all 60 nodes.
4. Start production pbs_server and Moab on storage node 1 plus PAM module
and all needed TORQUE services on the 58 compute nodes.
5. At this point all nodes should be using TORQUE 4.0.2.
6. Test before returning the cluster to the users.
FYI, I'm also contemplating an upgrade to Moab 7.0.2, but if at all
possible I'd rather do that as a separate phase. If I understand
correctly, Moab mostly runs on a single server (storage node 1 on our
cluster) with end-user binaries copied over to the login node, so after I
get some much-needed Moab training I'm hoping that this would be a fairly
Does this seem to be a workable plan? Could it be streamlined? Frankly,
I'm not particularly thrilled to have to do this kind of an upgrade so
early in my TORQUE experience, so if there's any possible way to reduce
the pain and suffering involved in the upgrade I'm all ears!!!
d-ulrick at comcast.net
More information about the torqueusers