[torqueusers] multi server config

etienne gondet etienne.gondet at mercator-ocean.fr
Mon Aug 8 03:09:03 MDT 2005


Guillaume,

    It works fine on mine by default or also by adding :
#PBS -eo
#PBS -o machine1.domain.country:/home/login/tomachine3.output

But the differences with your configuration are

torque is configured with rsh on machine1
but with scp on machine 2 + 3

and I don't have a FW between Machine1 and 2.

            Etienne.   
            GIP MERCATOR-Ocean.


Guillaume ALLEON a écrit:

> Everything is fine now my only problem is now to handle properly
> the return of $JOBID.ER & $JOBID.OU from a mom node with
> a non routing address to a public host different from the master.
>
> Have you an idea on how to solve this in a *clever* manner ?
>
> Guillaume
>
>
> etienne gondet a écrit :
>
>>
>> I recently tried this to submit from a front node of cluster1 to a 
>> cluster2 by
>>
>> from cluster1 : qmgr
>> create queue cluster2
>> set queue cluster2 queue_type = Route
>> set queue cluster2 route_destinations = feed at cluster2.mercator-ocean.fr
>> set queue cluster2 enabled = True
>> set queue cluster2 started = True
>>
>> It doesn't work if cluster1 still have a torque1.2.0pX xith X < 4
>> because the following patch have been included in torque1.2.0p4
>>
>> http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html
>>
>> And I guess supercluster staff forget to indicate that this patch was 
>> included in changelog.
>>
>>            Etienne Gondet
>>            GIP MERCATOR-Ocean.
>>
>>
>> Guillaume ALLEON a écrit:
>>
>>> Hi,
>>>
>>> I have three machines running:
>>> machine1: pbs_server
>>> machine2: pbs_server & pbs_sched
>>> machine3: pbs_mom
>>>
>>> The machines do not share any filesystem.
>>> machine2 has two NIC in such a way that machine3 does not see machine1.
>>> machine1 and machine2 are separated by a FW.
>>> I use torque with scp.
>>> The setup is done in such a way that machine2 & machine3 is working 
>>> fine. I just want to
>>> submit from an other server over a WAN.
>>>
>>> machine1 hosts a default queue that route to an other routing 
>>> default queue on machine 2.
>>>
>>> from machine1, the command :
>>>  echo hostname | qsub default at machine2
>>> is working fine except that stderr & stdout files are not sent back 
>>> to machine1 due to the
>>> fact that macine3 can not scp to machine1 (they don't see each other).
>>>
>>> Would it be possible to copy files  back from the mom node to the 
>>> server & tell the server to push
>>> them back to machine1.  A  kind  of thing  like :
>>>     scp -r /var/local/torque/spool/xx.machine2.OU   
>>> uid at machine2:/tempspace/
>>>     ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU 
>>> uid at machine1:/home/uid/"
>>> but handled directly by Torque.
>>> I don't want the internal node to see the outside of the cluster 
>>> except the frontal node. Perhaps there
>>> exist an other way to solve the problem ... I should not be the 
>>> first to come with this issue ...
>>>
>>> The other problem is that from machine 1, the command :
>>>  echo hostname | qsub
>>> does not reach the pbs_server on machine2. It is rejected by all 
>>> destinations. Usually I get this when
>>> the /etc/hosts.equiv is not properly set up but this is not the case 
>>> here.
>>>
>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser 
>>> request received from alleon at v810-su, sock=10
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>>> request AuthenticateUser on sd=10
>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request 
>>> received from alleon at v810-su, sock=9
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>>> request QueueJob on sd=9
>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request 
>>> received from alleon at v810-su, sock=9
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>>> request JobScript on sd=9
>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request 
>>> received from alleon at v810-su, sock=9
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>>> request ReadyToCommit on sd=9
>>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request 
>>> received from alleon at v810-su, sock=9
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>>> request Commit on sd=9
>>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into 
>>> default, state 1 hop 1
>>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>> setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
>>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>>> default, state TRANSIT
>>> 08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at 
>>> request of alleon at v810-su, owner = alleon at v810-su, job name= STDIN, 
>>> queue = default
>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>>> 08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by 
>>> all possible destinations
>>> 08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail 
>>> for job 53.v810-su to alleon at v810-su (Job rejected by all possible 
>>> destinations)
>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>>> setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
>>> 08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>>> default, state EXITING
>>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection 
>>> refused (111) in contact_sched, Could not contact Scheduler - p
>>>
>>> My machine1 server configuration is the following:
>>>
>>> #
>>> # Create queues and set their attributes.
>>> #
>>> #
>>> # Create and define queue default
>>> #
>>> create queue default
>>> set queue default queue_type = Route
>>> set queue default max_running = 45
>>> set queue default route_destinations = default at hal
>>> set queue default enabled = True
>>> set queue default started = True
>>> #
>>> # Set server attributes.
>>> #
>>> set server scheduling = True
>>> set server max_user_run = 5
>>> set server default_queue = default
>>> set server log_events = 511
>>> set server mail_from = adm
>>> set server query_other_jobs = True
>>> set server scheduler_iteration = 600
>>> set server node_ping_rate = 300
>>> set server node_check_rate = 600
>>> set server tcp_timeout = 6
>>> set server job_stat_rate = 30
>>> set server log_level = 7
>>>
>>> Any why the job is rejected in that case ?
>>>
>>> Guillaume
>>>
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
>




More information about the torqueusers mailing list