[torqueusers] Directly linked nodes via cross crossover cable

Aaron J. Greenwood agreenwo at uci.edu
Fri Feb 24 07:46:36 MST 2006


I followed your suggestions but am still having a problem.  I am using
the "hello" program for testing purposes.  I submitted the job on the
command line with qsub  ( qsub < hello.qsub )  You will note that even
through I create the machine_file and use it as a parameter for lamboot
the actual nodes allocated for the job are the ones from the
PBS_NODEFILE.not the machine_file.

I have included the script hello.qsub and the output file d_test1.o1310.

===============
FILE: hello.qsub
===============

#!/bin/csh

#PBS -N "d_test1"
#PBS -l nodes=2:ppn=2:p1
#PBS -j oe
#PBS -q highmem

cd $HOME/ptest/hello

echo "PBS_NODEFILE ALLOCATED = $PBS_NODEFILE"
echo "% cat $PBS_NODEFILE"
cat $PBS_NODEFILE

sed 's/compute/p/' < $PBS_NODEFILE | sed 's/oscarcluster/p1/' >
$HOME/ptest/hello/machine_file

echo "% cat $HOME/ptest/hello/machine_file"
cat $HOME/ptest/hello/machine_file


lamboot -s $HOME/ptest/hello/machine_file
lamnodes

mpirun C hello << EOF > log.hello_mpirun
32000
EOF

wipe -v  $HOME/ptest/hello/machine_file


======================
OUTPUT of d_test1.o1310
======================

echo "DONE"

PBS_NODEFILE ALLOCATED = /var/spool/pbs/aux/1310.headnode.oscarcluster
% cat /var/spool/pbs/aux/1310.headnode.oscarcluster
compute29.oscarcluster
compute29.oscarcluster
compute28.oscarcluster
compute28.oscarcluster

% cat /home/agreenwo/ptest/hello/machine_file
p29.p1
p29.p1
p28.p1
p28.p1

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n0      compute29.oscarcluster:2:origin,this_node
n1      compute28.oscarcluster:2:
n-1<4691> ssi:boot:base:linear: booting n0 (compute29.oscarcluster)
n-1<4691> ssi:boot:base:linear: booting n1 (compute28.oscarcluster)
n-1<4691> ssi:boot:base:linear: finished

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

DONE



Garrick Staples wrote:
> On Mon, Feb 13, 2006 at 08:26:20AM -0800, Aaron Greenwood alleged:
>   
>> Consider the following hardware configuration:
>>
>> NODE 1 (2 CPUS)
>> eth0 - Connected to cluster Ethernet switch.
>> eth1 - Directly linked via cross crossover cable to NODE 2
>>
>> NODE 2 (2 CPUS)
>> eth0 - Connected to cluster Ethernet switch.
>> eth1 - Directly linked via cross crossover cable to NODE 1
>>
>> Is it possible to configure PBS in such a way that a parallel job
>> submitted from the head node will use all CPUS on NODE 1 and NODE 2
>> running over the Ethernet cards that are directly linked?
>>     
>
> Not directly, no.
>
>  
>   
>> The directly linked cards are on a private network listed in the local
>> hosts file.
>>
>> I talked with a guy who does this. He said that in the script that he
>> submits his jobs he modifies the machine_file as in lamboot -s
>> machine_file. When I do that the jobs run using the Ethernet cards
>> connected to the cluster switch. I checked this by logging on to both of
>> the nodes and checking traffic with tcpdump and running lamnodes.
>>     
>
> Exactly what that guy said.  PBS passes the list of nodenames to a job
> by putting them in the filename in $PBS_NODEFILE.  Your job would simply
> make a local copy of $PBS_NODEFILE, transforming the hostnames to match
> that of the directly linked interfaces.
>
> For example, if $PBS_NODEFILE had "node01" and "node02", which refer to
> the switched interfaces, and "node01-direct" and "node02-direct" refer
> to the direct interfaces, your job could do something simple like:
>   sed 's/$/-direct/' < $PBS_NODEFILE > /tmp/machine_file
> And then use /tmp/machine_file with lamboot.
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>   




More information about the torqueusers mailing list