[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]

David Beer dbeer at adaptivecomputing.com
Tue Dec 14 16:02:49 MST 2010



----- Original Message -----
> Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED] Hi All,
> 
> I know I am shooting high here but has anyone got torque running in
> Torque HA mode (v. 2.5.3) using both Maui (3.3) and BLCR (I think this
> one my be my show stopper for HA). I am currently evaluating the power
> of HA with torque and what I wish to achieve is the ability to submit
> jobs from 2 nominated ‘head nodes’ for my cluster incase someone is
> special enough to send a compile or even run a job that goes ‘wild’
> and crashed my head node (has happened). The machine comes back and
> jobs continue to run, but the nodes status goes into a bit of a mess
> when the box crashes, as I would expect (torque does handle it quite
> well).
> 
> I am currently testing the following on a two Scientific Linux 5.5
> x86_64 VMs to see how far I should go in my production environment. I
> would love to get this all working and help the devs of torque if this
> does not work.
> 
> I have built each VM with BLCR and installed RPMs of 2.5.3 with
> ‘--enable-blcr --enable-high-availability’. I then built Maui, started
> both after setting server_name ‘server1,server2’ and set the following
> (it seems when I started the second node all the conf was already
> there – so I took that as a good thing):
> 
> qmgr -c "set server acl_hosts += c-test.ansto.gov.au"
> qmgr -c "set server acl_hosts += f-test.ansto.gov.au"
> qmgr -c "set lock_file = "/data1/var/run/pbs_server.lock"
> qmgr -c "set lock_file_check_time = 5"
> 
> I have also mounted (and plan to mount on my test compute nodes) a
> common /var/spool/torque/checkpoint NFS mountpoint. I could not work
> out if I required a common /var/spool/torque/server_priv on each
> ‘headnode’ ... Is that the case?
> 

Both of the nodes running pbs_server (I assume that's what you mean by 'headnode') need to be checking the same lock file or else ha won't work at all. Also, be sure to start pbs_server with the option --ha.

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1656 S. East Bay Blvd. Suite #300
     Provo, UT 84606



More information about the torqueusers mailing list