[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]

LEES, Cooper crl at ansto.gov.au
Tue Dec 14 15:36:34 MST 2010


Hi All,

I know I am shooting high here but has anyone got torque running in Torque
HA mode (v. 2.5.3) using both Maui (3.3) and BLCR (I think this one my be my
show stopper for HA). I am currently evaluating the power of HA with torque
and what I wish to achieve is the ability to submit jobs from 2 nominated
Œhead nodes¹ for my cluster incase someone is special enough to send a
compile or even run a job that goes Œwild¹ and crashed my head node (has
happened). The machine comes back and jobs continue to run, but the nodes
status goes into a bit of a mess when the box crashes, as I would expect
(torque does handle it quite well).

I am currently testing the following on a two Scientific Linux 5.5 x86_64
VMs to see how far I should go in my production environment. I would love to
get this all working and help the devs of torque if this does not work.

I have built each VM with BLCR and installed RPMs of 2.5.3 with
Œ--enable-blcr --enable-high-availability¹. I then built Maui, started both
after setting server_name Œserver1,server2¹ and set the following (it seems
when I started the second node all the conf was already there ­ so I took
that as a good thing):

qmgr -c "set server acl_hosts += c-test.ansto.gov.au"
qmgr -c "set server acl_hosts += f-test.ansto.gov.au"
qmgr -c "set lock_file = "/data1/var/run/pbs_server.lock"
qmgr -c "set lock_file_check_time = 5"

I have also mounted (and plan to mount on my test compute nodes) a common
/var/spool/torque/checkpoint NFS mountpoint. I could not work out if I
required a common /var/spool/torque/server_priv on each Œheadnode¹ ... Is
that the case?

P.S. 
I have found a few bugs with building RPMs and installing using
Œ--enable-blcr¹ - I will log a bug report
* Can¹t start torque until I manually create /var/spool/torque/checkpoint

Please let me know if I have done anything dumb or that could be done better
/ correctly.

Ta and Thanks for such an awesome free PBS system,
--
Cooper Ry Lees
HPC / UNIX Systems Administrator - Information Management Services (IMS)
Australian Nuclear Science and Technology Organisation
T  +61 2 9717 3853
F  +61 2 9717 9273
M  +61 403 739 446
E  cooper.lees at ansto.gov.au
www.ansto.gov.au <http://www.ansto.gov.au>

Important: This transmission is intended only for the use of the addressee.
It is confidential and may contain privileged information or copyright
material. If you are not the intended recipient, any use or further
disclosure of this communication is strictly forbidden. If you have received
this transmission in error, please notify me immediately by telephone and
delete all copies of this transmission as well as any attachments.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20101215/6a0d542b/attachment.html 


More information about the torqueusers mailing list