[torqueusers] multi server config
Guillaume ALLEON
guillaume.alleon at laposte.net
Fri Aug 5 13:13:45 MDT 2005
Everything is fine now my only problem is now to handle properly
the return of $JOBID.ER & $JOBID.OU from a mom node with
a non routing address to a public host different from the master.
Have you an idea on how to solve this in a *clever* manner ?
Guillaume
etienne gondet a écrit :
>
> I recently tried this to submit from a front node of cluster1 to a
> cluster2 by
>
> from cluster1 : qmgr
> create queue cluster2
> set queue cluster2 queue_type = Route
> set queue cluster2 route_destinations = feed at cluster2.mercator-ocean.fr
> set queue cluster2 enabled = True
> set queue cluster2 started = True
>
> It doesn't work if cluster1 still have a torque1.2.0pX xith X < 4
> because the following patch have been included in torque1.2.0p4
>
> http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html
>
> And I guess supercluster staff forget to indicate that this patch was
> included in changelog.
>
> Etienne Gondet
> GIP MERCATOR-Ocean.
>
>
> Guillaume ALLEON a écrit:
>
>> Hi,
>>
>> I have three machines running:
>> machine1: pbs_server
>> machine2: pbs_server & pbs_sched
>> machine3: pbs_mom
>>
>> The machines do not share any filesystem.
>> machine2 has two NIC in such a way that machine3 does not see machine1.
>> machine1 and machine2 are separated by a FW.
>> I use torque with scp.
>> The setup is done in such a way that machine2 & machine3 is working
>> fine. I just want to
>> submit from an other server over a WAN.
>>
>> machine1 hosts a default queue that route to an other routing default
>> queue on machine 2.
>>
>> from machine1, the command :
>> echo hostname | qsub default at machine2
>> is working fine except that stderr & stdout files are not sent back
>> to machine1 due to the
>> fact that macine3 can not scp to machine1 (they don't see each other).
>>
>> Would it be possible to copy files back from the mom node to the
>> server & tell the server to push
>> them back to machine1. A kind of thing like :
>> scp -r /var/local/torque/spool/xx.machine2.OU
>> uid at machine2:/tempspace/
>> ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU
>> uid at machine1:/home/uid/"
>> but handled directly by Torque.
>> I don't want the internal node to see the outside of the cluster
>> except the frontal node. Perhaps there
>> exist an other way to solve the problem ... I should not be the first
>> to come with this issue ...
>>
>> The other problem is that from machine 1, the command :
>> echo hostname | qsub
>> does not reach the pbs_server on machine2. It is rejected by all
>> destinations. Usually I get this when
>> the /etc/hosts.equiv is not properly set up but this is not the case
>> here.
>>
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser
>> request received from alleon at v810-su, sock=10
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching
>> request AuthenticateUser on sd=10
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching
>> request QueueJob on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching
>> request JobScript on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request
>> received from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching
>> request ReadyToCommit on sd=9
>> 08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request received
>> from alleon at v810-su, sock=9
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching
>> request Commit on sd=9
>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into
>> default, state 1 hop 1
>> 08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>> setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
>> 08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from
>> default, state TRANSIT
>> 08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at
>> request of alleon at v810-su, owner = alleon at v810-su, job name= STDIN,
>> queue = default
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>> setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
>> 08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by
>> all possible destinations
>> 08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail
>> for job 53.v810-su to alleon at v810-su (Job rejected by all possible
>> destinations)
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate:
>> setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
>> 08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from
>> default, state EXITING
>> 08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection refused
>> (111) in contact_sched, Could not contact Scheduler - p
>>
>> My machine1 server configuration is the following:
>>
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue default
>> #
>> create queue default
>> set queue default queue_type = Route
>> set queue default max_running = 45
>> set queue default route_destinations = default at hal
>> set queue default enabled = True
>> set queue default started = True
>> #
>> # Set server attributes.
>> #
>> set server scheduling = True
>> set server max_user_run = 5
>> set server default_queue = default
>> set server log_events = 511
>> set server mail_from = adm
>> set server query_other_jobs = True
>> set server scheduler_iteration = 600
>> set server node_ping_rate = 300
>> set server node_check_rate = 600
>> set server tcp_timeout = 6
>> set server job_stat_rate = 30
>> set server log_level = 7
>>
>> Any why the job is rejected in that case ?
>>
>> Guillaume
>>
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>
>>
>>
>
>
>
More information about the torqueusers
mailing list