[torqueusers] torque is working with openmpi?

Gus Correa gus at ldeo.columbia.edu
Tue Apr 17 11:38:52 MDT 2012


Hi Sergio

A) Your OpenMPI seems to have built with Infinband support.
However, as the error message say, you don't seem to have
Infinband interfaces [or the openib kernel modules are not
loaded].

To prevent OpenMPI to use Infiniband,
add '-mca btl ^openib'
to your mpirun command line.

A cleaner solution is to build OpenMPI with support only
to the hardware that you have in your machines.

**

B) Also, to use the OpenMPI-Torque integration you must
submit a job with *qsub*, not directly mpirun!
Torque will assign a list of nodes that will be
subsequently used by the mpirun *inside* the script that
you submitted via qsub.
This way you don't need to add a nodefile
to the mpirun command line.

For instance.

Write a script like this [say my_script]:
#PBS -l nodes=1:ppn=1
#PBS -q batch
#PBS -N hello
cd $PBS_O_WORKDIR
mpirun -np 2 ./hello

Then do:
qsub my_script

**

I hope this helps,
Gus Correa

On 04/17/2012 12:23 PM, Sergio Belkin wrote:
> Hi,
>
> I'm testing torque on Fedora 16. The problem is that jobs are not sent to
> Data:
>
>
>
> torque server: mpimaster.mycluster
> torque client: mpinode02.mycluster
>
> [sergio at mpimaster cluster]$ ompi_info | grep tm
>                   MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4)
>                   MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4)
>                   MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4)
>
>
> torque configuration:
>
> [root at mpimaster sergio]# cat /etc/torque/pbs_environment
> PATH=/bin:/usr/bin
> LANG=C
>
> cat /etc/torque/server_name
> mpimaster.mycluster
>
> [root at mpimaster sergio]# cat /etc/hosts
> 127.0.0.1               localhost.localdomain localhost
> ::1             localhost6.localdomain6 localhost6
> 192.168.122.1   mpinode02.mycluster mpinode02
> 192.168.122.2   mpimaster.mycluster mpimaster mpinode0
>
> cat /var/lib/torque/server_priv/nodes
> mpimaster np=1
> mpinode02 np=2
>
> [sergio at mpimaster ~]$ qmgr -c 'p s'
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch acl_user_enable = True
> set queue batch acl_users = sergio
> set queue batch resources_default.nodes = 2
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = mpimaster.mycluster
> set server acl_hosts += mpimaster
> set server acl_hosts += localhost
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server next_job_number = 402
> set server authorized_users = sergio at mpimaster
> set server authorized_users += sergio at mpinode02
>
>
> Client configuration:
>
> [sergio at mpimaster ~]$ cat /etc/hosts
> 127.0.0.1               localhost.localdomain localhost
> ::1             localhost6.localdomain6 localhost6
> 192.168.122.1   mpinode02.mycluster mpinode02
> 192.168.122.2   mpimaster.mycluster mpimaster mpinode01
> Tiene correo nuevo en /var/spool/mail/sergio
> [sergio at mpimaster ~]$ cat /etc/torque/
> mom/             pbs_environment  sched/           server_name
> [sergio at mpimaster ~]$ cat /etc/torque/server_name
> mpimaster.mycluster
> [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment
> PATH=/bin:/usr/bin
> LANG=C
> [sergio at mpimaster ~]$ cat /etc/torque/mom/config
> # Configuration for pbs_mom.
> $pbsserver mpimaster.mycluster
>
>
> Then I submit job via mpirun
>
> [sergio at mpimaster cluster]$ mpirun  hello
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> --------------------------------------------------------------------------
> [[54064,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
>    Host: mpimaster.mycluster
>
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
>
>
>
> If I use hostfile works:
>
> [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello
>
> KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/
> Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL
>
>   * Found existing ssh-agent (1607)
>   * Found existing gpg-agent (1690)
>   * Known ssh key: /home/sergio/.ssh/id_rsa
>
> librdmacm: couldn't read ABI version.
> librdmacm: assuming: 4
> CMA: unable to get RDMA device list
> --------------------------------------------------------------------------
> [[54073,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
>    Host: mpimaster.mycluster
>
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> Hello World! from process 2 out of 3 on mpinode02.mycluster
> Hello World! from process 1 out of 3 on mpinode02.mycluster
> Hello World! from process 0 out of 3 on mpimaster.mycluster
>
> Am I doing something bad?
>
> Thanks in advance!
>



More information about the torqueusers mailing list