[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]

Hi All,

I know I am shooting high here but has anyone got torque running in Torque
HA mode (v. 2.5.3) using both Maui (3.3) and BLCR (I think this one my be my
show stopper for HA). I am currently evaluating the power of HA with torque
and what I wish to achieve is the ability to submit jobs from 2 nominated
Œhead nodes¹ for my cluster incase someone is special enough to send a
compile or even run a job that goes Œwild¹ and crashed my head node (has
happened). The machine comes back and jobs continue to run, but the nodes
status goes into a bit of a mess when the box crashes, as I would expect
(torque does handle it quite well).

I am currently testing the following on a two Scientific Linux 5.5 x86_64
VMs to see how far I should go in my production environment. I would love to
get this all working and help the devs of torque if this does not work.

I have built each VM with BLCR and installed RPMs of 2.5.3 with
Œ--enable-blcr --enable-high-availability¹. I then built Maui, started both
after setting server_name Œserver1,server2¹ and set the following (it seems
when I started the second node all the conf was already there ­ so I took
that as a good thing):

qmgr -c "set server acl_hosts += c-test.ansto.gov.au"
qmgr -c "set server acl_hosts += f-test.ansto.gov.au"
qmgr -c "set lock_file = "/data1/var/run/pbs_server.lock"
qmgr -c "set lock_file_check_time = 5"

I have also mounted (and plan to mount on my test compute nodes) a common
/var/spool/torque/checkpoint NFS mountpoint. I could not work out if I
required a common /var/spool/torque/server_priv on each Œheadnode¹ ... Is
that the case?

I have found a few bugs with building RPMs and installing using
Œ--enable-blcr¹ - I will log a bug report
* Can¹t start torque until I manually create /var/spool/torque/checkpoint

Please let me know if I have done anything dumb or that could be done better
/ correctly.

Ta and Thanks for such an awesome free PBS system,
