[torqueusers] torque is working with openmpi?
Sergio Belkin
sebelk at gmail.com
Tue Apr 17 15:12:15 MDT 2012
2012/4/17 Gus Correa <gus at ldeo.columbia.edu>:
> Hi Sergio
>
> A) Your OpenMPI seems to have built with Infinband support.
> However, as the error message say, you don't seem to have
> Infinband interfaces [or the openib kernel modules are not
> loaded].
>
> To prevent OpenMPI to use Infiniband,
> add '-mca btl ^openib'
> to your mpirun command line.
>
> A cleaner solution is to build OpenMPI with support only
> to the hardware that you have in your machines.
Thanks for the hint!
>
> **
>
> B) Also, to use the OpenMPI-Torque integration you must
> submit a job with *qsub*, not directly mpirun!
> Torque will assign a list of nodes that will be
> subsequently used by the mpirun *inside* the script that
> you submitted via qsub.
> This way you don't need to add a nodefile
> to the mpirun command line.
>
> For instance.
>
> Write a script like this [say my_script]:
> #PBS -l nodes=1:ppn=1
> #PBS -q batch
> #PBS -N hello
> cd $PBS_O_WORKDIR
> mpirun -np 2 ./hello
>
> Then do:
> qsub my_script
Thanks for your help I've got the idea, that's worked!
>
> **
>
> I hope this helps,
> Gus Correa
>
> On 04/17/2012 12:23 PM, Sergio Belkin wrote:
>> Hi,
>>
>> I'm testing torque on Fedora 16. The problem is that jobs are not sent to
>> Data:
>>
>>
>>
>> torque server: mpimaster.mycluster
>> torque client: mpinode02.mycluster
>>
>> [sergio at mpimaster cluster]$ ompi_info | grep tm
>> MCA ras: tm (MCA v2.0, API v2.0, Component v1.5.4)
>> MCA plm: tm (MCA v2.0, API v2.0, Component v1.5.4)
>> MCA ess: tm (MCA v2.0, API v2.0, Component v1.5.4)
>>
>>
>> torque configuration:
>>
>> [root at mpimaster sergio]# cat /etc/torque/pbs_environment
>> PATH=/bin:/usr/bin
>> LANG=C
>>
>> cat /etc/torque/server_name
>> mpimaster.mycluster
>>
>> [root at mpimaster sergio]# cat /etc/hosts
>> 127.0.0.1 localhost.localdomain localhost
>> ::1 localhost6.localdomain6 localhost6
>> 192.168.122.1 mpinode02.mycluster mpinode02
>> 192.168.122.2 mpimaster.mycluster mpimaster mpinode0
>>
>> cat /var/lib/torque/server_priv/nodes
>> mpimaster np=1
>> mpinode02 np=2
>>
>> [sergio at mpimaster ~]$ qmgr -c 'p s'
>> #
>> # Create queues and set their attributes.
>> #
>> #
>> # Create and define queue batch
>> #
>> create queue batch
>> set queue batch queue_type = Execution
>> set queue batch acl_user_enable = True
>> set queue batch acl_users = sergio
>> set queue batch resources_default.nodes = 2
>> set queue batch resources_default.walltime = 01:00:00
>> set queue batch enabled = True
>> set queue batch started = True
>> #
>> # Set server attributes.
>> #
>> set server scheduling = True
>> set server acl_hosts = mpimaster.mycluster
>> set server acl_hosts += mpimaster
>> set server acl_hosts += localhost
>> set server default_queue = batch
>> set server log_events = 511
>> set server mail_from = adm
>> set server scheduler_iteration = 600
>> set server node_check_rate = 150
>> set server tcp_timeout = 6
>> set server next_job_number = 402
>> set server authorized_users = sergio at mpimaster
>> set server authorized_users += sergio at mpinode02
>>
>>
>> Client configuration:
>>
>> [sergio at mpimaster ~]$ cat /etc/hosts
>> 127.0.0.1 localhost.localdomain localhost
>> ::1 localhost6.localdomain6 localhost6
>> 192.168.122.1 mpinode02.mycluster mpinode02
>> 192.168.122.2 mpimaster.mycluster mpimaster mpinode01
>> Tiene correo nuevo en /var/spool/mail/sergio
>> [sergio at mpimaster ~]$ cat /etc/torque/
>> mom/ pbs_environment sched/ server_name
>> [sergio at mpimaster ~]$ cat /etc/torque/server_name
>> mpimaster.mycluster
>> [sergio at mpimaster ~]$ cat /etc/torque/pbs_environment
>> PATH=/bin:/usr/bin
>> LANG=C
>> [sergio at mpimaster ~]$ cat /etc/torque/mom/config
>> # Configuration for pbs_mom.
>> $pbsserver mpimaster.mycluster
>>
>>
>> Then I submit job via mpirun
>>
>> [sergio at mpimaster cluster]$ mpirun hello
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> --------------------------------------------------------------------------
>> [[54064,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: OpenFabrics (openib)
>> Host: mpimaster.mycluster
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>>
>>
>>
>> If I use hostfile works:
>>
>> [sergio at mpimaster cluster]$ mpirun --hostfile myhostfile hello
>>
>> KeyChain 2.6.8; http://www.gentoo.org/proj/en/keychain/
>> Copyright 2002-2004 Gentoo Foundation; Distributed under the GPL
>>
>> * Found existing ssh-agent (1607)
>> * Found existing gpg-agent (1690)
>> * Known ssh key: /home/sergio/.ssh/id_rsa
>>
>> librdmacm: couldn't read ABI version.
>> librdmacm: assuming: 4
>> CMA: unable to get RDMA device list
>> --------------------------------------------------------------------------
>> [[54073,1],0]: A high-performance Open MPI point-to-point messaging module
>> was unable to find any relevant network interfaces:
>>
>> Module: OpenFabrics (openib)
>> Host: mpimaster.mycluster
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> Hello World! from process 2 out of 3 on mpinode02.mycluster
>> Hello World! from process 1 out of 3 on mpinode02.mycluster
>> Hello World! from process 0 out of 3 on mpimaster.mycluster
>>
>> Am I doing something bad?
>>
>> Thanks in advance!
>>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
--
Sergio Belkin http://www.sergiobelkin.com
Watch More TV http://sebelk.blogspot.com
LPIC-2 Certified - http://www.lpi.org
More information about the torqueusers
mailing list