[torqueusers] Torque in a high-availability setting

Prakash Velayutham Prakash.Velayutham at cchmc.org
Fri Feb 22 09:23:20 MST 2008

Hello All,

I am trying to set up Torque (2.3.0) in a High Availability mode (NOT  
with the inbuilt HA feature that you start with --ha flag to  
pbs_server, but with heartbeat and shared storage using OCFS2).

Here is the setup:

	NIC eth0 - a.a.a.a
	NIC eth1 - b.b.b.b

	NIC eth0 - c.c.c.c
	NIC eth1 - d.d.d.d

I have both the eth1's connected to the cluster's private network.  
Both the eth0's are connected to the public nework. I currently do not  
have a separate heartbeat link between the servers, but soon will  
establish a serial link. Currently I am using eth1 for heartbeat too.

My HA resources that are being failed over are:

IP address - e.e.e.e (which will be in the public network)
IP address - f.f.f.f (which will be in the cluster private network)

I want a DNS entry for e.e.e.e (public IP) to be torqueserver and that  
is the IP address I want should be recognized as the server_name.

So essentially, when torqueserver1 goes down (scheduled or  
unscheduled), I would like e.e.e.e and f.f.f.f failed over to  
torqueserver2 and the DNS entry is still valid (as in any heartbeat  
managed IP resource).

How should my different configuration files be for this case  
(server_name in server/MOM, mom_priv/config etc.)? And does anyone  
already have this setup working?

I stumbled across this site while googling, but the status area warns  
that it is not working. http://www.gridpp.ac.uk/wiki/High_Availabilty_Torque 

I am also planning on doing the same with Moab, but that seems to be  
more difficult compared to this.

Thanks a lot,

More information about the torqueusers mailing list