[torqueusers] How to upgrade 3.x -> 4.x

Dave Ulrick d-ulrick at comcast.net
Thu Jun 28 08:44:29 MDT 2012


Hi,

I'm sysadmin for a new Linux-based HPC cluster that consists of login 
node, two storage nodes, and 60 compute nodes. The hardware vendor 
preinstalled it with TORQUE 3.0.4 with Moab 6.1.5 as the scheduler. I've 
been tasked with upgrading TORQUE to a newer release, hopefully 4.0.2. 
Also, I've been asked to install the TORQUE PAM module on the compute 
nodes in order to prevent non-root users from logging in interactively to 
a node unless they have a reservation to that node.

At a high level the upgrade path makes sense, but as for the details I'm a 
bit at sea. I have 10+ years experience with Linux and 20+ years as an IBM 
mainframe sysprog but I'm brand-new to the HPC as well as TORQUE and Moab. 
Therefore, I'm hoping you folks can offer some feedback insofar as how I 
might most easily accomplish the upgrade.

Right now TORQUE 3.0.4 and Moab 6.1.5 are running on the first storage 
node with pbs_mom (also TORQUE 3.0.4) running on all 60 compute nodes. To 
convert to TORQUE 4.0.2, I've cooked up this plan:

1. Clone the Moab directory tree to prepare for running a test Moab on 
storage node 1. The new Moab will be connected to the new TORQUE.

2. Modify the production TORQUE and Moab to use only 58 of the compute 
nodes. The remaining 2 nodes will be used by the new TORQUE.

3. Install a TORQUE 4.0.2 pbs_server on storage node 2. Configure this 
server to use the 2 compute nodes.

4. Install the TORQUE 4.0.2 pbs_mom and trqauthd services and PAM module 
on the 2 compute nodes.

5. Modify the configuration under /var/spool/torque on the 2 compute nodes 
to point to storage node 2.

6. Modify the configuration under the test Moab directory tree to listen 
on an alternate port and use the 2 compute nodes.

7. Install the new TORQUE commands to an alternate directory on the login 
node.

8. At this point, I should have the production TORQUE and Moab continuing 
to manage 58 of the compute nodes, with the new TORQUE and test Moab 
managing the other 2 nodes.

9. Test the new TORQUE in conjunction with the test Moab, most likely from 
a 'bash' shell whose environment includes the new TORQUE binaries in the 
PATH and points to the alternate Moab server via the appropriate 
environment variable.

Assuming that testing confirms that I have a stable configuration for the 
new TORQUE version, I'd then take over the whole cluster for a few hours 
and do the following:

1. Shut down both production and test Moab and TORQUE (pbs_server and all 
pbs_mom services).

2. On the login node, storage node 1, and 58 compute nodes, move the old 
TORQUE binaries, etc., to an alternate directory and copy in the new 
TORQUE files in their place. With this done, all 60 compute nodes should 
have identical TORQUE directory trees.

3. Configure the production Moab and TORQUE to manage all 60 nodes.

4. Start production pbs_server and Moab on storage node 1 plus PAM module 
and all needed TORQUE services on the 58 compute nodes.

5. At this point all nodes should be using TORQUE 4.0.2.

6. Test before returning the cluster to the users.

FYI, I'm also contemplating an upgrade to Moab 7.0.2, but if at all 
possible I'd rather do that as a separate phase. If I understand 
correctly, Moab mostly runs on a single server (storage node 1 on our 
cluster) with end-user binaries copied over to the login node, so after I 
get some much-needed Moab training I'm hoping that this would be a fairly 
straightforward upgrade.

Does this seem to be a workable plan? Could it be streamlined? Frankly, 
I'm not particularly thrilled to have to do this kind of an upgrade so 
early in my TORQUE experience, so if there's any possible way to reduce 
the pain and suffering involved in the upgrade I'm all ears!!!

Thanks,
Dave
-- 
Dave Ulrick
d-ulrick at comcast.net


More information about the torqueusers mailing list