[torqueusers] multi server config

Guillaume ALLEON guillaume.alleon at laposte.net
Thu Aug 4 13:02:39 MDT 2005


Hi,

I have three machines running:
machine1: pbs_server
machine2: pbs_server & pbs_sched
machine3: pbs_mom

The machines do not share any filesystem.
machine2 has two NIC in such a way that machine3 does not see machine1.
machine1 and machine2 are separated by a FW.
I use torque with scp.
The setup is done in such a way that machine2 & machine3 is working 
fine. I just want to
submit from an other server over a WAN.

machine1 hosts a default queue that route to an other routing default 
queue on machine 2.

from machine1, the command :
  echo hostname | qsub default at machine2
is working fine except that stderr & stdout files are not sent back to 
machine1 due to the
fact that macine3 can not scp to machine1 (they don't see each other).

Would it be possible to copy files  back from the mom node to the server 
& tell the server to push
them back to machine1.  A  kind  of thing  like :
     scp -r /var/local/torque/spool/xx.machine2.OU   
uid at machine2:/tempspace/
     ssh uid at machine2 "scp -r /tempspace/xx.machine2.OU 
uid at machine1:/home/uid/"
but handled directly by Torque.
I don't want the internal node to see the outside of the cluster except 
the frontal node. Perhaps there
exist an other way to solve the problem ... I should not be the first to 
come with this issue ...

The other problem is that from machine 1, the command :
  echo hostname | qsub
does not reach the pbs_server on machine2. It is rejected by all 
destinations. Usually I get this when
the /etc/hosts.equiv is not properly set up but this is not the case here.

08/04/2005 16:26:45;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from alleon at v810-su, sock=10
08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
request AuthenticateUser on sd=10
08/04/2005 16:26:45;0100;PBS_Server;Req;;Type QueueJob request received 
from alleon at v810-su, sock=9
08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
request QueueJob on sd=9
08/04/2005 16:26:45;0100;PBS_Server;Req;;Type JobScript request received 
from alleon at v810-su, sock=9
08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
request JobScript on sd=9
08/04/2005 16:26:45;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from alleon at v810-su, sock=9
08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
request ReadyToCommit on sd=9
08/04/2005 16:26:45;0100;PBS_Server;Req;;Type Commit request received 
from alleon at v810-su, sock=9
08/04/2005 16:26:45;0008;PBS_Server;Job;dispatch_request;dispatching 
request Commit on sd=9
08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;enqueuing into 
default, state 1 hop 1
08/04/2005 16:26:45;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 53.v810-su state from QUEUED to TRANSIT-TRNOUT (0-2)
08/04/2005 16:26:45;0100;PBS_Server;Job;53.v810-su;dequeuing from 
default, state TRANSIT
08/04/2005 16:26:45;0008;PBS_Server;Job;53.v810-su;Job Queued at request 
of alleon at v810-su, owner = alleon at v810-su, job name= STDIN, queue = default
08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 53.v810-su state from TRANSIT to QUEUED-QUEUED (1-10)
08/04/2005 16:26:47;0008;PBS_Server;Job;53.v810-su;Job rejected by all 
possible destinations
08/04/2005 16:26:47;000d;PBS_Server;Job;53.v810-su;sending 'a' mail for 
job 53.v810-su to alleon at v810-su (Job rejected by all possible 
destinations)
08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 53.v810-su state from QUEUED to EXITING-SUBSTATE55 (5-54)
08/04/2005 16:26:47;0100;PBS_Server;Job;53.v810-su;dequeuing from 
default, state EXITING
08/04/2005 16:26:47;0001;PBS_Server;Svr;PBS_Server;Connection refused 
(111) in contact_sched, Could not contact Scheduler - p

My machine1 server configuration is the following:

#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default max_running = 45
set queue default route_destinations = default at hal
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server max_user_run = 5
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server job_stat_rate = 30
set server log_level = 7

Any why the job is rejected in that case ?

Guillaume




More information about the torqueusers mailing list