Hi All,

Update: I now have a shared /var/spool/torque and torque running from each
test 'headnode' (lets call them node1 and node2) happily running using
'/usr/sbin/pbs_server -d /var/spool/torque --ha', logging seems to work fine
(i.e. Both servers log to the one file happily) and it all seems to be good.

I have added a test node (borrowed from the cluster and updated it's mom to
2.5.3) and make it's server_name file =
c-test.ansto.gov.au,f-test.ansto.gov.au (and I had to also edit $pbsserver
line of mom_priv/config as it had my main production PBS Server in there) to
get it online. 

But I have a few problems,

1) If the default node is up - I can not submit a job form the second head
[crl at node2 test-code]$ qsub test.pbs
qsub: Job rejected by all possible destinations

How can I can jobs acceptable - this is an ACL I would not doubt?

And if node1 is down:
[crl at node2 test-code]$ qsub test.pbs
Cannot connect to default server host 'c-test.ansto.gov.au' - check
pbs_server daemon.
qsub: cannot connect to server c-test.ansto.gov.au (errno=111) Connection

How can I make the 'redundant' node be able to submit jobs when node1 is

2) if the main pbs node (node1) goes down I can no longer communicate with
qstat, pbsnodes etc. on node2.
- How does torque to the 'failover' etc. I was hoping to be able to continue
to submit, qdel jobs etc. from the other node if one went down.

Am I expecting to much here?
- I guess I could use heartbeat cluster to share a common IP between the two
head nodes and get all the nodes to connect to that shared IP if this is a
torque limitation ... Have an active/passive cluster - but I thought this
was active/active.

I tried double checking the doco but it does not exactly spell out what I
get from starting the daemon with --ha (reference:

The man page for pbs_server does not even mention --ha ... ( pbs_server [-a
active] [-d config_path] [-p port] [-A acctfile] [-L logfile] [-M mom_port]
[-R momRPP_port] [-S scheduler_port] [-H hostname] [-t type])

On 15/12/10 11:06 AM, "Ken Nielson" <knielson at adaptivecomputing.com> wrote:

> On 12/14/2010 04:59 PM, LEES, Cooper wrote:
>> Hi Guys,
>> Thanks for your quick responses.
>> David:
>> - /data1 is a NFS volumed shared across all my nodes
>> - I have also created /var/spool/torque/checkpoint and mounted a NFS volume
>> there so it's accessible from all nodes and headnodes ...
>> - It seems that the HA is working fine - conf is syncing etc.
>> I was just confused if Torque version>2.4 still needed same
>> /var/spool/torque/server_priv file systems or they can write to local /var's
>> on each machine ...
> For now TORQUE HA must share a common file system. But something to
> think about. How to do HA without a common file system.  Hmm.
> Ken
