[torqueusers] problem with: set queue cfq route_destinations
Stewart.Samuels at sanofi-aventis.com
Stewart.Samuels at sanofi-aventis.com
Mon Aug 8 13:02:28 MDT 2005
I've tracked this problem down and actually have it working. I can submit jobs between different clusters on our network using this method. What I found was the code actually works, but it is only partially documented. I will describe how the code works and perhaps the folks at Supercluster.org can place it in the documentation appropriately. I will also provide the process here that I used to get things working.
Okay, first, there is a difference between the two commands you specify:
case 1: echo hostname | qsub -q cfq at other.host
case 2: echo hostname | qsub -q cfq
In case 1 the qsub command connects directly to the "other.host" and performs any routing necessary because both the queue and server are known to be remote by qsub by syntax definition. Hence, it does not require the use of any queues serviced by the local pbs_server.
Not so in case 2. Here, because no remote host is specified, "cfq" is assumed to be a queue known to the default pbs_server. Here, qsub actually queues the job and lets the default pbs_server do the routing.
When digging through the code, what I found was that pbs_server (when dealing with the routing of jobs) actually uses the "pbsnodes" command to get a list of valid destinations to which it can route. In other words, pbs_server validates the "host" portion (specified in the form "@hostname")of the names listed in the "route_destinations" attribute with the names of the nodes listed in its own $PBS_HOME/server_priv/nodes file.
So, in your case, simply place "other.host" in your nodes file on the system in which you are trying to transfer from. The pbsnodes command must show the remote node listed in your $PBS_SERVER/server_priv/nodes file as "free". For this to occur, the $PBS_SERVER/mom_priv/config file on the remote system must contain a $clienthost entry for the sending node. If the remote server is a master node for a cluster, run a mom on the master node that contains the sending server in its $PBS_HOME/mom_priv/config file. The compute nodes for the cluster do not require this entry.
This process will allow the pbs_server to transfer jobs to the remote queue.
If you are running basically a peer-to-peer cluster (where all nodes are on the same network...that is, no private networks separate masters and compute nodes on the remote system to which you are trying to route) the jobs will be returned to the sending server assuming you have the correct rsh/scp access (remember pbs_mom uses rcp/scp to return jobs and their output files.
In the case that the remote system to which you are trying to route is a Beowulf Cluster with a private network separating the master node and the compute nodes, then an additional step is required to get jobs returned to the sending host. That is, you must either make your master node a router (Network Admins generally frown on this) or set the IPTABLES on your master node to allow the compute nodes to tunnel back to the sending node (not their master node).
You have to remember that PBS is quite old, and when it was originally created, Beowulf Clusters supporting public and private networks didn't exist. Back then, it was a good assumption that because networks were very expensive, as were servers, a company would only have one network and all servers would connect to it. So, the simple (although very insecure) method of trasferring files through rcp was acceptable.
And that's it.
Okay, here's the general process I use for 2-way cluster-to-cluster (grids) routing:
1. Ensure master nodes of sending and receiving clusters can communicate as trusted hosts (no passwords) via rsh or ssh, depending on which you use.
2. Setup queue on sending master node with "route_destinations" having a value containing a queue on the receiving master node.
3. Enable this queue and allow jobs to be routed by setting both the queue and start attribute values to "true".
4. Setup $PBS_HOME/mom_priv/config file to contain $clienthost with sending master node hostname.
5. Setup $PBS_HOME/mom_priv/config file to contain $restricted with sending master node hostname.
6. Ensure that the compute/slave nodes of the receiving cluster can resolve the name of the sending master node. This is required because the compute/slave nodes return their output directly to the sending server and it uses gethostbyname to do this.
7. Modify the IPTABLES on the receiving master node to allow the compute/slave nodes to tunnel out to the sending master node.
8. Modify the $PBS_HOME/server_priv/nodes file to contain the hostname of the receiving master node.
9. Restart pbs_mom, pbs_server, and the maui scheduler on the sending master node.
10.Restart the pbs_mom on the receiving master node.
11.Perform a "pbsnodes -a" command. The receiving master node should be listed as a "free" node. This is required for pbs_server to route queued jobs to remote systems.
Hope this helps everyone in need.
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org]On Behalf Of
L.S.Lowe at bham.ac.uk
Sent: Tuesday, March 15, 2005 10:38 AM
To: torqueusers at supercluster.org
Subject: [torqueusers] problem with: set queue cfq route_destinations
=cfq at other.host
Hello, I'm having a problem reproducing OpenPBS routing behaviour on
Torque (torque version 1.2.0p1, if it matters).
This may be similar to other emails, in January
(subject PBS and Globus gatekeeper node),
which I didn't see a resolution for.
If I define a queue on my PC (which runs a pbs_server, scheduler, and
mom), in a way which works on OpenPBS:
create queue cfq
set queue cfq queue_type = Route
set queue cfq route_destinations = cfq at other.host
set queue cfq enabled = True
set queue cfq started = True
and then submit jobs as follows:
echo hostname | qsub -q cfq
then the jobs never get out of my machine: no attempt occurs to
contact other.host (as shown by tcpdump, and with firewalls off).
I receive an email:
PBS Job Id: 20.mypc.etc
Job Name: STDIN
Aborted by PBS Server
Job rejected by all possible destinations
I have my PC hostname in /etc/hosts.equiv on the other.host
(but, as I say, the request actually never leaves my PC).
On the other hand, if I submit:
echo hostname | qsub -q cfq at other.host
then it works.
Is this a feature of OpenPBS that's been removed from Torque?
Or is there another reason why it's not working?
Thanks, Lawrence Lowe.
Tel: 0121 414 4621 Fax: 0121 414 6709 Email: L.S.Lowe at bham.ac.uk
torqueusers mailing list
torqueusers at supercluster.org
More information about the torqueusers