[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]
crl at ansto.gov.au
Tue Dec 14 16:59:01 MST 2010
Thanks for your quick responses.
- /data1 is a NFS volumed shared across all my nodes
- I have also created /var/spool/torque/checkpoint and mounted a NFS volume there so it's accessible from all nodes and headnodes ...
- It seems that the HA is working fine - conf is syncing etc.
I was just confused if Torque version >2.4 still needed same /var/spool/torque/server_priv file systems or they can write to local /var's on each machine ...
I am testing with and will be using Torque 2.5.3 in production (updating from 2.1.8 in non HA mode - This was what Oscar originally installed 2+ years ago ... I have been lazy there - Bad admin).
Cooper Ry Lees
HPC / UNIX Administrator
Information Management Services (IMS)
Australian Nuclear Science and Technology Organisation (ANSTO)
[p] +61 2 9717 3853
[m] +61 403 739 446
[e] crl at ansto.gov.au
From: torqueusers-bounces at supercluster.org on behalf of David Beer
Sent: Wed 12/15/2010 10:02 AM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]
----- Original Message -----
> Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED] Hi All,
> I know I am shooting high here but has anyone got torque running in
> Torque HA mode (v. 2.5.3) using both Maui (3.3) and BLCR (I think this
> one my be my show stopper for HA). I am currently evaluating the power
> of HA with torque and what I wish to achieve is the ability to submit
> jobs from 2 nominated 'head nodes' for my cluster incase someone is
> special enough to send a compile or even run a job that goes 'wild'
> and crashed my head node (has happened). The machine comes back and
> jobs continue to run, but the nodes status goes into a bit of a mess
> when the box crashes, as I would expect (torque does handle it quite
> I am currently testing the following on a two Scientific Linux 5.5
> x86_64 VMs to see how far I should go in my production environment. I
> would love to get this all working and help the devs of torque if this
> does not work.
> I have built each VM with BLCR and installed RPMs of 2.5.3 with
> '--enable-blcr --enable-high-availability'. I then built Maui, started
> both after setting server_name 'server1,server2' and set the following
> (it seems when I started the second node all the conf was already
> there - so I took that as a good thing):
> qmgr -c "set server acl_hosts += c-test.ansto.gov.au"
> qmgr -c "set server acl_hosts += f-test.ansto.gov.au"
> qmgr -c "set lock_file = "/data1/var/run/pbs_server.lock"
> qmgr -c "set lock_file_check_time = 5"
> I have also mounted (and plan to mount on my test compute nodes) a
> common /var/spool/torque/checkpoint NFS mountpoint. I could not work
> out if I required a common /var/spool/torque/server_priv on each
> 'headnode' ... Is that the case?
Both of the nodes running pbs_server (I assume that's what you mean by 'headnode') need to be checking the same lock file or else ha won't work at all. Also, be sure to start pbs_server with the option --ha.
Direct Line: 801-717-3386 | Fax: 801-717-3738
1656 S. East Bay Blvd. Suite #300
Provo, UT 84606
torqueusers mailing list
torqueusers at supercluster.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 4657 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20101215/9d72b07e/attachment-0001.bin
More information about the torqueusers