[torqueusers] Torque environment problem

Gustavo Correa gus at ldeo.columbia.edu
Sun Mar 20 18:06:17 MDT 2011


Hi Randall

I don't think there is a Torque/OpenMPI compatibility problem.
We have Torque 2.4.11 and OpenMPI 1.4.3 on Linux x86_64.
They work hand in hand, no problem at all.
OpenMPI picks exactly the nodes and cores provided by Torque.
There is no need to concoct a hostfile.
In the past I used several combinations of previous (and newer) versions
of OpenMPI compiled with support for different versions of Torque with no
problem either (Torque 2.3,6, and 2.5.4, OpenMPI 1.2.8, 1.3.2, 1.3.3, 1.4.2).

A bunch of guesses below.

1) Passwordless ssh

For OpenMPI to work right you must setup passwordless ssh across the nodes.
Is this working?
Can you ssh without password across all pairs of nodes?

2) ssh 

Note 'ssh', not 'rsh'.
I don't know if Torque/OpenMPI integration works if you switch from ssh to rsh to connect
across nodes.

3) various MPI versions.

Also, make sure you are not using another version of mpiexec inadvertently.
Perhaps one that doesn't have Torque support.
Compilers, and some Linux distributions may come with these extras.
You could either use full path names to mpicc when you compile, and mpiexec
when you run the program, or make sure OpenMPI 'bin'  and 'lib' directories come 
first in your PATH and LD_LIBRARY_PATH.

4) same OpenMPI on all nodes

In addition, both the very same OpenMPI installation must be available on all nodes.
Either installed exactly the same way on all nodes in local directories,
or with a single centralized installation shared by the nodes via NFS or equivalent.

5) same environment on all nodes

Furthermore, your environment must be consistent (preferably the same) on all nodes.
If you have separate home directories on each node, you need to set your .bashrc/.cshrc
on all nodes to point to the OpenMPI bin and lib directories (local directories).
If home is exported by the head node and mounted all all nodes via NFS,
then a single .bashrc/.cshrc on the head node will suffice. 

6) are the pbs_mom daemons working on all nodes?

Another check is to look for the output of 'pbsnodes -a',
and see if all nodes report back.

7) minimalist test

A minimalist test of OpenMPI with Torque is 

#PBS -q myqueue at myhost.mydomain
#PBS -l nodes=$NODES:ppn=$CORES
mpiexec -np $TOTAL_CORES hostname

You don't need the -mca plm tm 

It should list the names of all hosts as many times as the number of cores per node.

8) Torque and Slurm?

Finally, are you running slurm along with Torque?
If you are, I wonder if this may cause a conflict.
[Any] Two resource managers most likely won't get together with each other.

I hope this helps,
Gus Correa


On Mar 20, 2011, at 2:32 PM, Svancara, Randall wrote:

> Hi,
> 
> I have recompiled openmpi-1.4.3 with tm support. 
> 
> I have confirmed that it is available via:
> 
> [rsvancara at node1 ~]$ ompi_info |grep plm
>                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
>                  MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
>                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
> 
> [rsvancara at node164 ~]$ ompi_info |grep plm
>                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
>                  MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
>                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
> 
> 
> When I launch jobs using openmpi, I have to use:
> 
> -mca plm rsh
> 
> If I set this to
> 
> -mca plm tm
> 
> Then no remote processes are launched.  I do not mind using rsh, however, I would prefer to have torque "Do the right thing" and just work.  I am using torque version 2.4.7.
> 
> Is this a torque/openmpi compatibility issue?  Or is this how torque is supposed to work with openmpi?  I thought torque would launch the remote processes and clean them up after.
> 
> I would appreciate any suggestions. 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Gustavo Correa
> Sent: Sat 3/19/2011 5:48 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Torque environment problem
> 
> Hi Randall
> 
> If you build OpenMPI with Torque support, mpiexec will use the
> nodes and processors provided by Torque, and you don't
> need to provide any hostfile whatsoever.
> We've been using OpenMPI with Torque support for quite a while.
> 
> To do so, you need to configure OpenMPI this way:
> 
> ./configure --prefix=/directory/to/install/openmpi --with-tm=/directory/where/you/installed/torque
> 
> See the OpenMPI FAQ about this:
> http://www.open-mpi.org/faq/?category=building#build-rte-tm
> 
> Still, although your script to restore the "np=$NPROC" syntax is very clever,
> I guess you could use directly the $PBS_NODEFILE as your hostfile,
> when OpenMPI is not built with Torque support.
> 
> The issue with LD_LIBRARY_PATH may be in addition to the nodefile mismatch
> problem you had.
> OpenMPI requires both PATH and LD_LIBRARY_PATH to be set on all hosts
> where the parallel program runs:
> 
> http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
> 
> If your home directory is NFS mounted on your cluster,
> the easy way to do it is to set both in your .bashrc/.cshrc file.
> 
> I hope this helps.
> Gus Correa
> 
> 
> On Mar 19, 2011, at 8:15 PM, Svancara, Randall wrote:
> 
> > Hi
> >
> > I did figure out the issue, or at least I am on the path to a solution.
> >
> > I was assuming that when I submit a job via torque with the PBS parameter: #PBS -l nodes=12:ppn=12 that the PBS_NODEFILE parameter would have the correctly formatted hosts file for openmpi.
> >
> > What I am seeing is that torque will generate a hosts file that looks like this:
> >
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > ....
> >
> >
> > But from what I can see, openmpi expects a hostfile list likethis:
> >
> > node164 slots=12
> > node163 slots=12
> >
> > So what I had to do in my script is add the following code:
> >
> > np=$(cat $PBS_NODEFILE | wc -l)
> >
> > for i in `cat ${PBS_NODEFILE}|sort -u`; do
> >   echo $i slots=12 > /home/admins/rsvancara/nodes
> > done
> >
> > /usr/mpi/intel/openmpi-1.4.3/bin/mpiexec $RUN $MCA -np $np -hostfile /home/admins/rsvancara/nodes /home/admins/rsvancara/TEST/mpitest
> >
> > I guess I was expecting openmpi to do the right thing but apparently torque and openmpi are not on the same page in terms of formatting for a hosts file.  I am using version 2.4.7 of torque.  Would newer versions of torque correctly generate a hosts file?
> >
> > The strange thing is that why would openmpi just simply tell me it may be a LD_LIBRARY_PATH problem seems rather vague.  A better response would be "What the .... am I supposed to do with this hosts file you idiot, please format it correctly". 
> >
> > Best regards
> >
> > Randall
> >
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org on behalf of Shenglong Wang
> > Sent: Fri 3/18/2011 8:45 PM
> > To: Torque Users Mailing List
> > Subject: Re: [torqueusers] Torque environment problem
> >
> > Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?
> >
> > np=$(cat $PBS_NODEFILE | wc -l)
> >
> > mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX
> >
> > Best,
> >
> > Shenglong
> >
> >
> >
> >
> > On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:
> >
> > > I just wanted to add that if I launch a job on one node, everything works fine.  For example in my job script if I specify
> > >
> > >
> > > #PBS -l nodes=1:ppn=12
> > >
> > > Then everything runs fine.
> > >
> > >
> > > However, if I specify two nodes, then everything fails.
> > >
> > >
> > > #PBS -l nodes=1:ppn=12
> > >
> > > This also fails
> > >
> > >
> > > #PBS -l nodes=13
> > >
> > > But this does not:
> > >
> > >
> > > #PBS -l nodes=12
> > >
> > > Thanks,
> > >
> > > Randall
> > >
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
> > > Sent: Fri 3/18/2011 7:48 PM
> > > To: torqueusers at supercluster.org
> > > Subject: [torqueusers] Torque environment problem
> > >
> > >
> > > Hi,
> > >
> > > We are in the process of setting up a new cluster.   One issue I am experiencing is with openmpi jobs launched through torque.
> > >
> > > When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:
> > >
> > > **************************
> > >
> > > [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
> > > --------------------------------------------------------------------------
> > > A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> > > launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see above).
> > >
> > > This may be because the daemon was unable to find all the needed shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > > location of the shared libraries on the remote nodes and this will
> > > automatically be forwarded to the remote nodes.
> > > --------------------------------------------------------------------------
> > > --------------------------------------------------------------------------
> > > mpirun noticed that the job aborted, but has no info as to the process
> > > that caused that situation.
> > > --------------------------------------------------------------------------
> > > --------------------------------------------------------------------------
> > > mpirun was unable to cleanly terminate the daemons on the nodes shown
> > > below. Additional manual cleanup may be required - please refer to
> > > the "orte-clean" tool for assistance.
> > > --------------------------------------------------------------------------
> > >         node163 - daemon did not report back when launched
> > > Completed executing:
> > >
> > > *************************
> > >
> > > However when launch a job running mpiexec, everything seems to work fine using the following script:
> > >
> > > /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest
> > >
> > > The job runs on 24 nodes with 12 processes per node.
> > >
> > > I have verified that my .bashrc is working.  I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success.  I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".
> > >
> > > My question is:
> > >
> > > 1.  Has anyone had this problem before (I am sure they have)
> > > 2.  How would I go about troubleshooting this problem.
> > >
> > >
> > > I am using torque version 2.4.7.
> > >
> > > Thanks for any assistance anyone can provide.
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list