[torqueusers] Torque HA w/Maui + BLCR [SEC=UNCLASSIFIED]

Al Taufer ataufer at adaptivecomputing.com
Mon Dec 20 16:33:05 MST 2010


Do both of your headnodes (pbs_server nodes) show up as a submit_host in qmgr? I think they both need to be listed there.

Al Taufer

----- Original Message -----
> Hi All,
> 
> Update: I now have a shared /var/spool/torque and torque running from
> each
> test 'headnode' (lets call them node1 and node2) happily running using
> '/usr/sbin/pbs_server -d /var/spool/torque --ha', logging seems to
> work fine
> (i.e. Both servers log to the one file happily) and it all seems to be
> good.
> 
> I have added a test node (borrowed from the cluster and updated it's
> mom to
> 2.5.3) and make it's server_name file =
> c-test.ansto.gov.au,f-test.ansto.gov.au (and I had to also edit
> $pbsserver
> line of mom_priv/config as it had my main production PBS Server in
> there) to
> get it online.
> 
> But I have a few problems,
> 
> 1) If the default node is up - I can not submit a job form the second
> head
> node:
> [crl at node2 test-code]$ qsub test.pbs
> qsub: Job rejected by all possible destinations
> 
> How can I can jobs acceptable - this is an ACL I would not doubt?
> 
> And if node1 is down:
> [crl at node2 test-code]$ qsub test.pbs
> Cannot connect to default server host 'c-test.ansto.gov.au' - check
> pbs_server daemon.
> qsub: cannot connect to server c-test.ansto.gov.au (errno=111)
> Connection
> refused
> 
> How can I make the 'redundant' node be able to submit jobs when node1
> is
> down?
> 
> 2) if the main pbs node (node1) goes down I can no longer communicate
> with
> qstat, pbsnodes etc. on node2.
> - How does torque to the 'failover' etc. I was hoping to be able to
> continue
> to submit, qdel jobs etc. from the other node if one went down.
> 
> Am I expecting to much here?
> - I guess I could use heartbeat cluster to share a common IP between
> the two
> head nodes and get all the nodes to connect to that shared IP if this
> is a
> torque limitation ... Have an active/passive cluster - but I thought
> this
> was active/active.
> 
> I tried double checking the doco but it does not exactly spell out
> what I
> get from starting the daemon with --ha (reference:
> http://www.clusterresources.com/torquedocs21/4.2high-availability.shtml)
> 
> The man page for pbs_server does not even mention --ha ... (
> pbs_server [-a
> active] [-d config_path] [-p port] [-A acctfile] [-L logfile] [-M
> mom_port]
> [-R momRPP_port] [-S scheduler_port] [-H hostname] [-t type])
> 
> Ta,
> --
> Cooper Ry Lees
> HPC / UNIX Systems Administrator - Information Technology Services
> (ITS)
> Australian Nuclear Science and Technology Organisation
> T +61 2 9717 3853
> F +61 2 9717 9273
> M +61 403 739 446
> E cooper.lees at ansto.gov.au
> www.ansto.gov.au <http://www.ansto.gov.au>
> 
> Important: This transmission is intended only for the use of the
> addressee.
> It is confidential and may contain privileged information or copyright
> material. If you are not the intended recipient, any use or further
> disclosure of this communication is strictly forbidden. If you have
> received
> this transmission in error, please notify me immediately by telephone
> and
> delete all copies of this transmission as well as any attachments.
> 
> 
> On 15/12/10 11:06 AM, "Ken Nielson" <knielson at adaptivecomputing.com>
> wrote:
> 
> > On 12/14/2010 04:59 PM, LEES, Cooper wrote:
> >> Hi Guys,
> >>
> >> Thanks for your quick responses.
> >>
> >> David:
> >> - /data1 is a NFS volumed shared across all my nodes
> >> - I have also created /var/spool/torque/checkpoint and mounted a
> >> NFS volume
> >> there so it's accessible from all nodes and headnodes ...
> >> - It seems that the HA is working fine - conf is syncing etc.
> >>
> >> I was just confused if Torque version>2.4 still needed same
> >> /var/spool/torque/server_priv file systems or they can write to
> >> local /var's
> >> on each machine ...
> >>
> >>
> > For now TORQUE HA must share a common file system. But something to
> > think about. How to do HA without a common file system. Hmm.
> >
> > Ken
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list