[torqueusers] multi server config

Guillaume ALLEON guillaume.alleon at laposte.net
Wed Aug 10 03:15:13 MDT 2005


Stewart,

Thanks,
I had a look to the thread you mentionned and find quite usefull your 
mail describing the different steps.

Concerning the first issue, I do confirm that on my machine1 I do not 
have any $PBS_HOME/server_priv/nodes
file so I think it is not needed to put machine2 there. But I have the 
names in my hosts.equiv file. I did not tried
the torque.cfg file yet to avoid this. By the way do you know where it 
shoul be located ? Is it in $PBS_HOME/ ?


Second issue, I am stuck with this IPTABLES issue since I am not 
familiar with this tool. Posting in Linux configuration
forums did not help so much as weel since I want only to allow scp from 
machine3 to machine1 through
machine2 and not from machine1 to machine3.
I came with a rule like:
  iptables -t nat -A PREROUTING -p tcp -m tcp -i eth1 --dport 22 -j DNAT 
--to-destination machine1:22
do you think it can solve my problem ?

Guillaume

Stewart.Samuels at sanofi-aventis.com a écrit :

>Guillaume,
>
>See the threads posted to the torqueusers group subject line "problem with: set queue cfq route_destinations=cfq at other.host".  Note, routing from machine1 to machine2 should now be doable without placing the machine2 in machine1's $PBS_HOME/server_priv/nodes file if you are running torque-1.2.0p4 or later.
>
>Regarding your second issue, machine3 routing jobs back to machine1.  An alternative is to provide tunneling through machine2 by using IPTABLES.  That is, allow machine2 perform NAT (Network Address Translation) for packets destined machine1 from machine3.
>
>	Stewart
>
>-----Original Message-----
>From: torqueusers-bounces at supercluster.org
>[mailto:torqueusers-bounces at supercluster.org]On Behalf Of Guillaume
>ALLEON
>Sent: Thursday, August 04, 2005 3:03 PM
>To: torqueusers at supercluster.org
>Subject: [torqueusers] multi server config
>
>
>Hi,
>
>I have three machines running:
>machine1: pbs_server
>machine2: pbs_server & pbs_sched
>machine3: pbs_mom
>
>The machines do not share any filesystem.
>machine2 has two NIC in such a way that machine3 does not see machine1.
>machine1 and machine2 are separated by a FW.
>I use torque with scp.
>The setup is done in such a way that machine2 & machine3 is working 
>fine. I just want to
>submit from an other server over a WAN.
>
>machine1 hosts a default queue that route to an other routing default 
>queue on machine 2.
>
>from machine1, the command :
>  echo hostname | qsub default at machine2
>is working fine except that stderr & stdout files are not sent back to 
>machine1 due to the
>fact that macine3 can not scp to machine1 (they don't see each other).
>
>Would it be possible to copy files  back from the mom node to the server 
>& tell the server to push
>them back to machine1.  A  kind  of thing  like :
>     scp -r /var/local/torque/spool/xx.machine2.OU   
>uid at machine2:/tempspace/
>     ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU 
>uid at machine1:/home/uid/"
>but handled directly by Torque.
>I don't want the internal node to see the outside of the cluster except 
>the frontal node. Perhaps there
>exist an other way to solve the problem ... I should not be the first to 
>come with this issue ...
>
>The other problem is that from machine 1, the command :
>  echo hostname | qsub
>does not reach the pbs_server on machine2. It is rejected by all 
>destinations. Usually I get this when
>the /etc/hosts.equiv is not properly set up but this is not the case here.
>
>08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser request 
>received from alleon at v810-su, sock=10
>08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>request AuthenticateUser on sd=10
>08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request received 
>from alleon at v810-su, sock=9
>08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>request QueueJob on sd=9
>08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request received 
>from alleon at v810-su, sock=9
>08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>request JobScript on sd=9
>08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request 
>received from alleon at v810-su, sock=9
>08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>request ReadyToCommit on sd=9
>08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request received 
>from alleon at v810-su, sock=9
>08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>request Commit on sd=9
>08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into 
>default, state 1 hop 1
>08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
>08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>default, state TRANSIT
>08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at request 
>of alleon at v810-su, owner = alleon at v810-su, job name= STDIN, queue = default
>08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by all 
>possible destinations
>08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail for 
>job 53.v810-su to alleon at v810-su (Job rejected by all possible 
>destinations)
>08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
>08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>default, state EXITING
>08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection refused 
>(111) in contact_sched, Could not contact Scheduler - p
>
>My machine1 server configuration is the following:
>
>#
># Create queues and set their attributes.
>#
>#
># Create and define queue default
>#
>create queue default
>set queue default queue_type = Route
>set queue default max_running = 45
>set queue default route_destinations = default at hal
>set queue default enabled = True
>set queue default started = True
>#
># Set server attributes.
>#
>set server scheduling = True
>set server max_user_run = 5
>set server default_queue = default
>set server log_events = 511
>set server mail_from = adm
>set server query_other_jobs = True
>set server scheduler_iteration = 600
>set server node_ping_rate = 300
>set server node_check_rate = 600
>set server tcp_timeout = 6
>set server job_stat_rate = 30
>set server log_level = 7
>
>Any why the job is rejected in that case ?
>
>Guillaume
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers
>
>  
>



More information about the torqueusers mailing list