[torqueusers] ha torque
prakash.velayutham at cchmc.org
Wed Apr 9 12:44:18 MDT 2008
As you indicate, I also would like to do a IP takeover with services
Torque/Moab failing over to the redundant node using Heartbeat. I have
tried to implement that setup at my site and failed.
My thread at http://www.clusterresources.com/pipermail/torqueusers/2008-February/006883.html
never got any useful response and I am still waiting.
Someone else replied off list, which also implies that this does not
"I have been working on this very same model for almost a year, in
bits and pieces. I have gone to the extent described at the site
posted in the url you provided (http://www.gridpp.ac.uk/wiki/High_Availabilty_Torque
) using DRBD. The problem you eventually face is that TORQUE uses the
gethostbyname routine to identify the host and therefore, when
failingover, you eventually must have the failover node be identified
as the original system. In other words, at some point, you are forced
to rename the server being failed over to and reboot it.
Additionally, what I have done is created a system startup script
(hpcnetwork) that performs all the tasks to start or turn on dhpcd,
named, pbs_server based on a ping to the primary master host
(essentially what Linux HA code does). My goal was to have the HA
script invoke the hpcnetwork script as a resource to perform the tasks
for failover when the event was detected.
I continue to work on this as my own little project and have created a
VMWARE cluster (2 masters, 1 compute node) sessions on my laptop to
test. In addition to trying to get this to work, I have been trying
to intergrate it into our kickstart scripts as part of an automated
build procedure to create our HA clusters. I have most pieces working
in some state or other, but none of it is what I would consider robust
enough to consider placing in our production environments yet.
Additionally, I have downloaded a 2.3.snapshot and am now testing with
this. The ability to have TORQUE and maui using a shared device
rather than a block device that needs to be shared with something like
DRBD, makes life much simpler from a configuration perspective. It
certainly alliviates the necessity to configure something like DRBD on
the fly. But for sites that don't have a redundant NAS device or for
people that simply would rather use DRBD, it is still a worthy effort."
On Apr 9, 2008, at 2:30 PM, Daniel Bourque wrote:
> how much disk space does /var/spool/torque/server_priv typically use ?
> how about the maui scheduler ? should it be running on both
> headnodes, trying to communicate with localhost ?
> I'm a little confused by the example, where the scheduler runs on
> the the hosts as pbs_mom and not pbs_server... is the intent to also
> failover the scheduler along with the shared file system ?
> thanks again.
> Daniel Bourque
> Sr. Systems Engineer
> WeatherData Service Inc
> An Accuweather Company
> Steve Snelgrove wrote:
>> The 2.3 release of Torque has support for HA by allowing two head
>> node server to access the server_priv files on a shared file
>> system. See http://www.clusterresources.com/torquedocs21/4.3high-availability.shtml
>> for more details.
>> Daniel Bourque wrote:
>>> We're planning on setting up a torque/Maui cluster. I'm planning
>>> on making the head node also be worker nodes, and for a 2nd worker
>>> node to be a failover headnode.
>>> My intent is to use heartbeat to control the state of torque, Maui
>>> and a service IP.
>>> Is this possible ?
>>> what files need to be kept in sync ?
>>> if the headnode fails, what happens to running jobs ?
>>> if the headnode fails, when Maui start on the new headnode, will
>>> it query the pbs_mom daemons on the worker nodes to get usage info ?
> torqueusers mailing list
> torqueusers at supercluster.org
Programmer / Analyst
Cincinnati Children's Hospital Medical Center
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers