[torqueusers] OpenMPI mpirun problem with TORQUE

Gus Correa gus at ldeo.columbia.edu
Mon Jan 25 13:43:33 MST 2010


Hi bugslayer

It sounds more as a problem with your Torque setup than with
OpenMPI mpirun.
Try running "hostname" only using Torque to check this out.

What is the content of your torque "nodes" file?
Typically it is located on your head node (where the pbs_server runs)
at $TORQUEHOME/nodes, and has contents like this:

simulation01 np=2
simulation02 np=2
...

assuming you have 2 CPUs/cores, and guessing "simulation01 ..."
are your nodes' host names.

Also, are the pbs_mom daemons running all nodes?
Check on the nodes with "service pbs[_mom] status",
or ps aux|grep pbs_mom.
If not, you can use chkconfig to activate them at boot time.

I hope this helps,
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

buqslayer wrote:
> Yes, I did.
> 
>  
> 
> TORQUE
> 
> (server) ./configure *–prefix=/usr/local/* --with-rcp=scp
> 
> (clients) use shell script (generated by make packages in server)
> 
>  
> 
> OpenMPI
> 
> ./configure –prefix=/usr/local/openmpi *–with-tm=/usr/local/*
> 
>  
> 
>  
> 
> *From:* Si Hammond [mailto:simon.hammond at gmail.com]
> *Sent:* Monday, January 25, 2010 1:25 AM
> *To:* 
> *Cc:* Si Hammond; torqueusers at supercluster.org
> *Subject:* Re: [torqueusers] OpenMPI mpirun problem with TORQUE
> 
>  
> 
> Out of interest when you specified the --with-tm did you give the 
> configure a directory to find the PBS installation?
> 
>  
> 
>  
> 
>  
> 
>  
> 
> On 24 Jan 2010, at 11:18,  wrote:
> 
> 
> 
> Sure.
> 
> As I mentioned, mpirun works correctly. The problem occurs only via torque.
> 
>  
> 
> *From:* Si Hammond [mailto:simon.hammond at gmail.com] 
> *Sent:* Sunday, January 24, 2010 8:12 PM
> *To:* 
> *Cc:* Si Hammond; torqueusers at supercluster.org 
> <mailto:torqueusers at supercluster.org>
> *Subject:* Re: [torqueusers] OpenMPI mpirun problem with TORQUE
> 
>  
> 
> Can you SSH from one node to the next without passwords etc?
> 
>  
> 
>  
> 
>  
> 
> On 23 Jan 2010, at 23:03, wrote:
> 
> 
> 
> 
> Hi all.
> 
>  
> 
> I have little (but serious) problem when submitting a job using mpirun.
> 
>  
> 
> There’s no problem with just “1” node (many processors) like below.
> 
>  
> 
> (job script)
> 
>  
> 
> #!/bin/sh
> 
> #PBS -l nodes=1:ppn=2
> 
> #PBS -j oe
> 
>  
> 
> echo "HOSTNAME : $HOSTNAME"
> 
> echo "PBS_NODEFILE = $PBS_NODEFILE"
> 
> cat $PBS_NODEFILE
> 
> mpirun /home/jhlee/test_program
> 
> echo "finish : $(date)"
> 
>  
> 
>  
> 
> (result) – test_program just prints message whether it is executed by 
> mpirun or not.
> 
>  
> 
> start  : Sun Jan 24 07:46:27 KST 2010
> 
> HOSTNAME : simulation01
> 
> PBS_NODEFILE = /var/spool/torque/aux//31.simulation00
> 
> simulation01
> 
> simulation01
> 
> Detected OpenMPI Runtime Environment
> 
> Detected OpenMPI Runtime Environment
> 
> finish : Sun Jan 24 07:46:29 KST 2010
> 
>  
> 
> But with many nodes like below, mpirun cannot make test_program start.
> 
>  
> 
> #PBS -l nodes=2:ppn=2 (other things are same)
> 
>  
> 
> I can’t find any process. There’s only mpirun, no ‘test_program’. Please 
> check the ‘ps’ result below.
> 
>  
> 
> 21680 ?        S      0:00 mpirun /home/jhlee/test_program
> 
> 21684 ?        Ss     0:00 bash -c ps ax | grep test
> 
> 21712 ?        R      0:00 grep test
> 
>  
> 
> 1.     mpirun(not via TORQUE) works correctly.
> 
> 2.     OpenMPI was built with –with-tm option.
> 
> 3.     iptables, selinux has been shutdown already. And no password is 
> required to connect other nodes using ssh.
> 
> 4.     OpenMPI 1.4.1, TORQUE 2.4.4
> 
>  
> 
> What can I check to solve this ?
> 
>  
> 
> Thanks.
> 
>  
> 
> -------------------------------------------------------------------------------------------
> 
>  
> 
> Jeong-hyun Lee
> 
>  
> 
> Visual Simulation Laboratory
> 
> Department of Computer Science and Engineering
> 
> Dongguk University, Seoul, Korea
> 
>  
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
>  
> 
> 
> ---------------------------------------------------------------------------------------
> 
> Si Hammond
> 
>  
> 
> Research & Knowledge Transfer Associate
> 
> Performance Modelling, Analysis and Optimisation Team
> 
> High Performance Systems Group
> 
> Department of Computer Science
> 
> University of Warwick, CV4 7AL, UK
> 
> http://go.warwick.ac.uk/hpsg
> 
> ----------------------------------------------------------------------------------------
> 
>  
> 
>  
> 
>  
> 
> 
> 
> 
>  
> 
>  
> 
> 
> ---------------------------------------------------------------------------------------
> 
> Si Hammond
> 
>  
> 
> Research & Knowledge Transfer Associate
> 
> Performance Modelling, Analysis and Optimisation Team
> 
> High Performance Systems Group
> 
> Department of Computer Science
> 
> University of Warwick, CV4 7AL, UK
> 
> http://go.warwick.ac.uk/hpsg
> 
> ----------------------------------------------------------------------------------------
> 
>  
> 
>  
> 
>  
> 
> 
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list