[torqueusers] ha torque

Prakash Velayutham prakash.velayutham at cchmc.org
Wed Apr 9 12:44:18 MDT 2008

Hi Daniel,

As you indicate, I also would like to do a IP takeover with services  
Torque/Moab failing over to the redundant node using Heartbeat. I have  
tried to implement that setup at my site and failed.

My thread at http://www.clusterresources.com/pipermail/torqueusers/2008-February/006883.html 
  never got any useful response and I am still waiting.

Someone else replied off list, which also implies that this does not  

"I have been working on this very same model for almost a year, in  
bits and pieces.  I have gone to the extent described at the site  
posted in the url you provided (http://www.gridpp.ac.uk/wiki/High_Availabilty_Torque 
) using DRBD.  The problem you eventually face is that TORQUE uses the  
gethostbyname routine to identify the host and therefore, when  
failingover, you eventually must have the failover node be identified  
as the original system.  In other words, at some point, you are forced  
to rename the server being failed over to and reboot it.   
Additionally, what I have done is created a system startup script  
(hpcnetwork) that performs all the tasks to start or turn on dhpcd,  
named, pbs_server based on a ping to the primary master host  
(essentially what Linux HA code does).  My goal was to have the HA  
script invoke the hpcnetwork script as a resource to perform the tasks  
for failover when the event was detected.

I continue to work on this as my own little project and have created a  
VMWARE cluster (2 masters, 1 compute node) sessions on my laptop to  
test.  In addition to trying to get this to work, I have been trying  
to intergrate it into our kickstart scripts as part of an automated  
build procedure to create our HA clusters.  I have most pieces working  
in some state or other, but none of it is what I would consider robust  
enough to consider placing in our production environments yet.

Additionally, I have downloaded a 2.3.snapshot and am now testing with  
this.  The ability to have TORQUE and maui using a shared device  
rather than a block device that needs to be shared with something like  
DRBD, makes life much simpler from a configuration perspective.  It  
certainly alliviates the necessity to configure something like DRBD on  
the fly.  But for sites that don't have a redundant NAS device or for  
people that simply would rather use DRBD, it is still a worthy effort."


On Apr 9, 2008, at 2:30 PM, Daniel Bourque wrote:

> thanks
> how much disk space does /var/spool/torque/server_priv typically use ?
> how about the maui scheduler ? should it be running on both  
> headnodes, trying to communicate with localhost ?
> I'm a little confused by the example, where the scheduler runs on  
> the the hosts as pbs_mom and not pbs_server... is the intent to also  
> failover the scheduler along with the shared file system ?
> thanks again.
> Daniel Bourque
> Sr. Systems Engineer
> WeatherData Service Inc
> An Accuweather Company
> Steve Snelgrove wrote:
>> The 2.3 release of Torque has support for HA by allowing two head  
>> node server to access the server_priv files on a shared file  
>> system.  See http://www.clusterresources.com/torquedocs21/4.3high-availability.shtml 
>>  for more details.
>> Daniel Bourque wrote:
>>> Hi,
>>>   We're planning on setting up a torque/Maui cluster. I'm planning  
>>> on making the head node also be worker nodes, and for a 2nd worker  
>>> node to be a failover headnode.
>>> My intent is to use heartbeat to control the state of torque, Maui  
>>> and a service IP.
>>> Is this possible ?
>>> what files need to be kept in sync ?
>>> if the headnode fails, what happens to running jobs ?
>>> if the headnode fails, when Maui start on the new headnode, will  
>>> it query the pbs_mom daemons on the worker nodes to get usage info ?
>>> Thanks
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

Prakash Velayutham
Programmer / Analyst
Cincinnati Children's Hospital Medical Center

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080409/df04f65f/attachment.html

More information about the torqueusers mailing list