[torqueusers] multi server config

Guillaume ALLEON guillaume.alleon at laposte.net
Mon Aug 8 10:30:08 MDT 2005


Etienne,

Now everything is fine when submitting what is not working is the "come 
back" of ER & OU files.
Do you confirm that the scp is working from your mom on machine 3 to 
your machine 1. Do you
 have specific iptables rules on your machine2 ?
Guillaume

PS: May be we can discuss this by phone of see each other in tlse.

etienne gondet a écrit :

> Guillaume,
>
>    It works fine on mine by default or also by adding :
> #PBS -eo
> #PBS -o machine1.domain.country:/home/login/tomachine3.output
>
> But the differences with your configuration are
>
> torque is configured with rsh on machine1
> but with scp on machine 2 + 3
>
> and I don't have a FW between Machine1 and 2.
>
>            Etienne.              GIP MERCATOR-Ocean.
>
>
> Guillaume ALLEON a écrit:
>
>> Everything is fine now my only problem is now to handle properly
>> the return of $JOBID.ER & $JOBID.OU from a mom node with
>> a non routing address to a public host different from the master.
>>
>> Have you an idea on how to solve this in a *clever* manner ?
>>
>> Guillaume
>>
>>
>> etienne gondet a écrit :
>>
>>>
>>> I recently tried this to submit from a front node of cluster1 to a 
>>> cluster2 by
>>>
>>> from cluster1 : qmgr
>>> create queue cluster2
>>> set queue cluster2 queue_type = Route
>>> set queue cluster2 route_destinations = feed at cluster2.mercator-ocean.fr
>>> set queue cluster2 enabled = True
>>> set queue cluster2 started = True
>>>
>>> It doesn't work if cluster1 still have a torque1.2.0pX xith X < 4
>>> because the following patch have been included in torque1.2.0p4
>>>
>>> http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html 
>>>
>>>
>>> And I guess supercluster staff forget to indicate that this patch 
>>> was included in changelog.
>>>
>>>            Etienne Gondet
>>>            GIP MERCATOR-Ocean.
>>>
>>>
>>> Guillaume ALLEON a écrit:
>>>
>>>> Hi,
>>>>
>>>> I have three machines running:
>>>> machine1: pbs_server
>>>> machine2: pbs_server & pbs_sched
>>>> machine3: pbs_mom
>>>>
>>>> The machines do not share any filesystem.
>>>> machine2 has two NIC in such a way that machine3 does not see 
>>>> machine1.
>>>> machine1 and machine2 are separated by a FW.
>>>> I use torque with scp.
>>>> The setup is done in such a way that machine2 & machine3 is working 
>>>> fine. I just want to
>>>> submit from an other server over a WAN.
>>>>
>>>> machine1 hosts a default queue that route to an other routing 
>>>> default queue on machine 2.
>>>>
>>>> from machine1, the command :
>>>>  echo hostname | qsub default at machine2
>>>> is working fine except that stderr & stdout files are not sent back 
>>>> to machine1 due to the
>>>> fact that macine3 can not scp to machine1 (they don't see each other).
>>>>
>>>> Would it be possible to copy files  back from the mom node to the 
>>>> server & tell the server to push
>>>> them back to machine1.  A  kind  of thing  like :
>>>>     scp -r /var/local/torque/spool/xx.machine2.OU   
>>>> uid at machine2:/tempspace/
>>>>     ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU 
>>>> uid at machine1:/home/uid/"
>>>> but handled directly by Torque.
>>>> I don't want the internal node to see the outside of the cluster 
>>>> except the frontal node. Perhaps there
>>>> exist an other way to solve the problem ... I should not be the 
>>>> first to come with this issue ...
>>>>
>>>> The other problem is that from machine 1, the command :
>>>>  echo hostname | qsub
>>>> does not reach the pbs_server on machine2. It is rejected by all 
>>>> destinations. Usually I get this when
>>>> the /etc/hosts.equiv is not properly set up but this is not the 
>>>> case here.
>>>>
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser 
>>>> request received from alleon at v810-su, sock=10
>>>> 08/04/2005 
>>>> 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching request 
>>>> AuthenticateUser on sd=10
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request 
>>>> received from alleon at v810-su, sock=9
>>>> 08/04/2005 
>>>> 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching request 
>>>> QueueJob on sd=9
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request 
>>>> received from alleon at v810-su, sock=9
>>>> 08/04/2005 
>>>> 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching request 
>>>> JobScript on sd=9
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request 
>>>> received from alleon at v810-su, sock=9
>>>> 08/04/2005 
>>>> 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching request 
>>>> ReadyToCommit on sd=9
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request 
>>>> received from alleon at v810-su, sock=9
>>>> 08/04/2005 
>>>> 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching request 
>>>> Commit on sd=9
>>>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into 
>>>> default, state 1 hop 1
>>>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>>> setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
>>>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>>>> default, state TRANSIT
>>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at 
>>>> request of alleon at v810-su, owner = alleon at v810-su, job name= STDIN, 
>>>> queue = default
>>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>>>> 08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by 
>>>> all possible destinations
>>>> 08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail 
>>>> for job 53.v810-su to alleon at v810-su (Job rejected by all possible 
>>>> destinations)
>>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>>> setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
>>>> 08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>>>> default, state EXITING
>>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection 
>>>> refused (111) in contact_sched, Could not contact Scheduler - p
>>>>
>>>> My machine1 server configuration is the following:
>>>>
>>>> #
>>>> # Create queues and set their attributes.
>>>> #
>>>> #
>>>> # Create and define queue default
>>>> #
>>>> create queue default
>>>> set queue default queue_type = Route
>>>> set queue default max_running = 45
>>>> set queue default route_destinations = default at hal
>>>> set queue default enabled = True
>>>> set queue default started = True
>>>> #
>>>> # Set server attributes.
>>>> #
>>>> set server scheduling = True
>>>> set server max_user_run = 5
>>>> set server default_queue = default
>>>> set server log_events = 511
>>>> set server mail_from = adm
>>>> set server query_other_jobs = True
>>>> set server scheduler_iteration = 600
>>>> set server node_ping_rate = 300
>>>> set server node_check_rate = 600
>>>> set server tcp_timeout = 6
>>>> set server job_stat_rate = 30
>>>> set server log_level = 7
>>>>
>>>> Any why the job is rejected in that case ?
>>>>
>>>> Guillaume
>>>>
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>
>
>
>



More information about the torqueusers mailing list