[torqueusers] multi server config

Guillaume ALLEON guillaume.alleon at laposte.net
Fri Aug 5 13:13:45 MDT 2005


Everything is fine now my only problem is now to handle properly
the return of $JOBID.ER & $JOBID.OU from a mom node with
a non routing address to a public host different from the master.

Have you an idea on how to solve this in a *clever* manner ?

Guillaume


etienne gondet a écrit :

>
> I recently tried this to submit from a front node of cluster1 to a 
> cluster2 by
>
> from cluster1 : qmgr
> create queue cluster2
> set queue cluster2 queue_type = Route
> set queue cluster2 route_destinations = feed at cluster2.mercator-ocean.fr
> set queue cluster2 enabled = True
> set queue cluster2 started = True
>
> It doesn't work if cluster1 still have a torque1.2.0pX xith X < 4
> because the following patch have been included in torque1.2.0p4
>
> http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html
>
> And I guess supercluster staff forget to indicate that this patch was 
> included in changelog.
>
>            Etienne Gondet
>            GIP MERCATOR-Ocean.
>
>
> Guillaume ALLEON a écrit:
>
>> Hi,
>>
>> I have three machines running:
>> machine1: pbs_server
>> machine2: pbs_server & pbs_sched
>> machine3: pbs_mom
>>
>> The machines do not share any filesystem.
>> machine2 has two NIC in such a way that machine3 does not see machine1.
>> machine1 and machine2 are separated by a FW.
>> I use torque with scp.
>> The setup is done in such a way that machine2 & machine3 is working 
>> fine. I just want to
>> submit from an other server over a WAN.
>>
>> machine1 hosts a default queue that route to an other routing default 
>> queue on machine 2.
>>
>> from machine1, the command :
>>  echo hostname | qsub default at machine2
>> is working fine except that stderr & stdout files are not sent back 
>> to machine1 due to the
>> fact that macine3 can not scp to machine1 (they don't see each other).
>>
>> Would it be possible to copy files  back from the mom node to the 
>> server & tell the server to push
>> them back to machine1.  A  kind  of thing  like :
>>     scp -r /var/local/torque/spool/xx.machine2.OU   
>> uid at machine2:/tempspace/
>>     ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU 
>> uid at machine1:/home/uid/"
>> but handled directly by Torque.
>> I don't want the internal node to see the outside of the cluster 
>> except the frontal node. Perhaps there
>> exist an other way to solve the problem ... I should not be the first 
>> to come with this issue ...
>>
>> The other problem is that from machine 1, the command :
>>  echo hostname | qsub
>> does not reach the pbs_server on machine2. It is rejected by all 
>> destinations. Usually I get this when
>> the /etc/hosts.equiv is not properly set up but this is not the case 
>> here.
>>
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser 
>> request received from alleon at v810-su, sock=10
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>> request AuthenticateUser on sd=10
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request 
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>> request QueueJob on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request 
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>> request JobScript on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request 
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>> request ReadyToCommit on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request received 
>> from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
>> request Commit on sd=9
>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into 
>> default, state 1 hop 1
>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>> setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>> default, state TRANSIT
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at 
>> request of alleon at v810-su, owner = alleon at v810-su, job name= STDIN, 
>> queue = default
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>> 08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by 
>> all possible destinations
>> 08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail 
>> for job 53.v810-su to alleon at v810-su (Job rejected by all possible 
>> destinations)
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
>> setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
>> 08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from 
>> default, state EXITING
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection refused 
>> (111) in contact_sched, Could not contact Scheduler - p
>>
>> My machine1 server configuration is the following:
>>
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue default
>> #
>> create queue default
>> set queue default queue_type = Route
>> set queue default max_running = 45
>> set queue default route_destinations = default at hal
>> set queue default enabled = True
>> set queue default started = True
>> #
>> # Set server attributes.
>> #
>> set server scheduling = True
>> set server max_user_run = 5
>> set server default_queue = default
>> set server log_events = 511
>> set server mail_from = adm
>> set server query_other_jobs = True
>> set server scheduler_iteration = 600
>> set server node_ping_rate = 300
>> set server node_check_rate = 600
>> set server tcp_timeout = 6
>> set server job_stat_rate = 30
>> set server log_level = 7
>>
>> Any why the job is rejected in that case ?
>>
>> Guillaume
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>
>
>
>



More information about the torqueusers mailing list