[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]

LEES, Cooper crl at ansto.gov.au
Wed Dec 15 18:34:58 MST 2010


Hi All,

Update: I now have a shared /var/spool/torque and torque running from each
test 'headnode' (lets call them node1 and node2) happily running using
'/usr/sbin/pbs_server -d /var/spool/torque --ha', logging seems to work fine
(i.e. Both servers log to the one file happily) and it all seems to be good.

I have added a test node (borrowed from the cluster and updated it's mom to
2.5.3) and make it's server_name file =
c-test.ansto.gov.au,f-test.ansto.gov.au (and I had to also edit $pbsserver
line of mom_priv/config as it had my main production PBS Server in there) to
get it online. 

But I have a few problems,

1) If the default node is up - I can not submit a job form the second head
node:
[crl at node2 test-code]$ qsub test.pbs
qsub: Job rejected by all possible destinations

How can I can jobs acceptable - this is an ACL I would not doubt?

And if node1 is down:
[crl at node2 test-code]$ qsub test.pbs
Cannot connect to default server host 'c-test.ansto.gov.au' - check
pbs_server daemon.
qsub: cannot connect to server c-test.ansto.gov.au (errno=111) Connection
refused

How can I make the 'redundant' node be able to submit jobs when node1 is
down?

2) if the main pbs node (node1) goes down I can no longer communicate with
qstat, pbsnodes etc. on node2.
- How does torque to the 'failover' etc. I was hoping to be able to continue
to submit, qdel jobs etc. from the other node if one went down.

Am I expecting to much here?
- I guess I could use heartbeat cluster to share a common IP between the two
head nodes and get all the nodes to connect to that shared IP if this is a
torque limitation ... Have an active/passive cluster - but I thought this
was active/active.

I tried double checking the doco but it does not exactly spell out what I
get from starting the daemon with --ha (reference:
http://www.clusterresources.com/torquedocs21/4.2high-availability.shtml)

The man page for pbs_server does not even mention --ha ... ( pbs_server [-a
active] [-d config_path] [-p port] [-A acctfile] [-L logfile] [-M mom_port]
[-R momRPP_port] [-S scheduler_port] [-H hostname] [-t type])

Ta,
--
Cooper Ry Lees
HPC / UNIX Systems Administrator - Information Technology Services (ITS)
Australian Nuclear Science and Technology Organisation
T  +61 2 9717 3853
F  +61 2 9717 9273
M  +61 403 739 446
E  cooper.lees at ansto.gov.au
www.ansto.gov.au <http://www.ansto.gov.au>

Important: This transmission is intended only for the use of the addressee.
It is confidential and may contain privileged information or copyright
material. If you are not the intended recipient, any use or further
disclosure of this communication is strictly forbidden. If you have received
this transmission in error, please notify me immediately by telephone and
delete all copies of this transmission as well as any attachments.


On 15/12/10 11:06 AM, "Ken Nielson" <knielson at adaptivecomputing.com> wrote:

> On 12/14/2010 04:59 PM, LEES, Cooper wrote:
>> Hi Guys,
>> 
>> Thanks for your quick responses.
>> 
>> David:
>> - /data1 is a NFS volumed shared across all my nodes
>> - I have also created /var/spool/torque/checkpoint and mounted a NFS volume
>> there so it's accessible from all nodes and headnodes ...
>> - It seems that the HA is working fine - conf is syncing etc.
>> 
>> I was just confused if Torque version>2.4 still needed same
>> /var/spool/torque/server_priv file systems or they can write to local /var's
>> on each machine ...
>> 
>>    
> For now TORQUE HA must share a common file system. But something to
> think about. How to do HA without a common file system.  Hmm.
> 
> Ken
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list