[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]
knielson at adaptivecomputing.com
Tue Dec 14 15:50:59 MST 2010
Please let us know which version of TORQUE you are using.
----- Original Message -----
From: "Cooper LEES" <crl at ansto.gov.au>
To: "Torque Users Mailing List" <torqueusers at supercluster.org>
Cc: "Greg DOHERTY" <gdz at ansto.gov.au>, "Ramzi KUTTEH" <rku at ansto.gov.au>
Sent: Tuesday, December 14, 2010 3:36:34 PM
Subject: [torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]
Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED] Hi All,
I know I am shooting high here but has anyone got torque running in Torque HA mode (v. 2.5.3) using both Maui (3.3) and BLCR (I think this one my be my show stopper for HA). I am currently evaluating the power of HA with torque and what I wish to achieve is the ability to submit jobs from 2 nominated ‘head nodes’ for my cluster incase someone is special enough to send a compile or even run a job that goes ‘wild’ and crashed my head node (has happened). The machine comes back and jobs continue to run, but the nodes status goes into a bit of a mess when the box crashes, as I would expect (torque does handle it quite well).
I am currently testing the following on a two Scientific Linux 5.5 x86_64 VMs to see how far I should go in my production environment. I would love to get this all working and help the devs of torque if this does not work.
I have built each VM with BLCR and installed RPMs of 2.5.3 with ‘--enable-blcr --enable-high-availability’. I then built Maui, started both after setting server_name ‘server1,server2’ and set the following (it seems when I started the second node all the conf was already there – so I took that as a good thing):
qmgr -c "set server acl_hosts += c-test.ansto.gov.au"
qmgr -c "set server acl_hosts += f-test.ansto.gov.au"
qmgr -c "set lock_file = "/data1/var/run/pbs_server.lock"
qmgr -c "set lock_file_check_time = 5"
I have also mounted (and plan to mount on my test compute nodes) a common /var/spool/torque/checkpoint NFS mountpoint. I could not work out if I required a common /var/spool/torque/server_priv on each ‘headnode’ ... Is that the case?
I have found a few bugs with building RPMs and installing using ‘--enable-blcr’ - I will log a bug report
• Can’t start torque until I manually create /var/spool/torque/checkpoint
Please let me know if I have done anything dumb or that could be done better / correctly.
Ta and Thanks for such an awesome free PBS system,
Cooper Ry Lees
HPC / UNIX Systems Administrator - Information Management Services (IMS)
Australian Nuclear Science and Technology Organisation
T +61 2 9717 3853
F +61 2 9717 9273
M +61 403 739 446
E cooper.lees at ansto.gov.au
www.ansto.gov.au < http://www.ansto.gov.au >
Important: This transmission is intended only for the use of the addressee.
It is confidential and may contain privileged information or copyright material. If you are not the intended recipient, any use or further disclosure of this communication is strictly forbidden. If you have received this transmission in error, please notify me immediately by telephone and delete all copies of this transmission as well as any attachments.
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers