[torqueusers] MPI job submitted with TORQUE does not use InfiniBand if running and start nodes overlap [2]

Guilherme Menegon Arantes garantes at iq.usp.br
Fri Jun 21 07:34:18 MDT 2013


Hi there,

Just a quick message to let you know that the problem disappears when
upgrading to TORQUE 3.0.6.

Cheers,

Guilherme



> Date: Wed, 19 Jun 2013 14:55:50 -0300
> From: Guilherme Menegon Arantes <garantes at iq.usp.br>
> Subject: [torqueusers] MPI job submitted with TORQUE does not use
> 	InfiniBand if running and start nodes overlap [2]
> To: torqueusers at supercluster.org
> Message-ID: <20130619175550.GA9844 at iq.usp.br>
> Content-Type: text/plain; charset=iso-8859-1
> 
> 
> Hi there,
> 
> I am using Intel MPI (4.1.0.024 from ICS 2013.0.028) to run my parallel
> application (Gromacs 4.6.1 molecular dynamics) on a SGI cluster with
> CentOS 6.2 and Torque 2.5.12.
> 
> When I submitt a MPI job with Torque to start and run on 2 nodes, MPI 
> startup fails to negotiate with Infiniband (IB) and internode 
> communication falls back to Ethernet. This is my job script:
> 
> #PBS -l nodes=n001:ppn=32+n002:ppn=32
> #PBS -q normal
> source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
> source /opt/progs/gromacs/bin/GMXRC.bash
> cd $PBS_O_WORKDIR/
> export I_MPI_DEBUG=2
> mpiexec.hydra -machinefile macs -np 64 mdrun_mpi >& md.out
> 
> Of course the machinefile macs may be obtained from $PBS_NODEFILE, but
> was fixed in this example. The output is:
> 
> [54] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
> ...
> [45] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
> ...
> [33] MPI startup(): DAPL provider <NULLstring> on rank 0:n001 differs from ofa-v2-mlx4_0-1(v2.0) on rank 33:n002
> ...
> [0] MPI startup(): shm and tcp data transfer modes
> ...
> 
> However, MPI negotiates fine with IB if I run the same mpiexec.hydra
> line from the console either logged to n001 (one of the running nodes)
> or logged in another, say the admin, node. It also works fine if I
> submitt the TORQUE job using a different start node than the running
> nodes (-machinefile macs points to n001 and n002), say using #PBS -l 
> nodes=n003 and the rest identical to as above. This a succesfull  
> output:
> 
> [55] DAPL startup(): trying to open default DAPL provider from dat registry: ofa-v2-mlx4_0-1
> ...
> [29] MPI startup(): DAPL provider ofa-v2-mlx4_0-1
> ...
> [0] MPI startup(): shm and dapl data transfer modes
> ...
> 
> Any tips on what is going wrong? Pls, let me know if you need more info.
> This has also been posted to the Intel MPI forum, but your help is 
> appreciated too.
> 
--

Prof. Dr. Guilherme Menegon Arantes

Instituto de Química
Universidade de São Paulo
Av. Prof. Lineu Prestes, 748
São Paulo          05508-000
Brasil
Fone: 55-11-30913848
http://gaznevada.iq.usp.br/
___________________________________



More information about the torqueusers mailing list