[torqueusers] Torque environment problem

Svancara, Randall rsvancara at wsu.edu
Sun Mar 20 12:32:58 MDT 2011


Hi,

I have recompiled openmpi-1.4.3 with tm support.  

I have confirmed that it is available via:

[rsvancara at node1 ~]$ ompi_info |grep plm
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)

[rsvancara at node164 ~]$ ompi_info |grep plm
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)


When I launch jobs using openmpi, I have to use:

-mca plm rsh

If I set this to

-mca plm tm

Then no remote processes are launched.  I do not mind using rsh, however, I would prefer to have torque "Do the right thing" and just work.  I am using torque version 2.4.7.

Is this a torque/openmpi compatibility issue?  Or is this how torque is supposed to work with openmpi?  I thought torque would launch the remote processes and clean them up after.

I would appreciate any suggestions.  








-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of Gustavo Correa
Sent: Sat 3/19/2011 5:48 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Torque environment problem
 
Hi Randall

If you build OpenMPI with Torque support, mpiexec will use the
nodes and processors provided by Torque, and you don't 
need to provide any hostfile whatsoever.
We've been using OpenMPI with Torque support for quite a while.

To do so, you need to configure OpenMPI this way:

./configure --prefix=/directory/to/install/openmpi --with-tm=/directory/where/you/installed/torque

See the OpenMPI FAQ about this:
http://www.open-mpi.org/faq/?category=building#build-rte-tm

Still, although your script to restore the "np=$NPROC" syntax is very clever, 
I guess you could use directly the $PBS_NODEFILE as your hostfile,
when OpenMPI is not built with Torque support.

The issue with LD_LIBRARY_PATH may be in addition to the nodefile mismatch
problem you had.
OpenMPI requires both PATH and LD_LIBRARY_PATH to be set on all hosts
where the parallel program runs:

http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path

If your home directory is NFS mounted on your cluster,
the easy way to do it is to set both in your .bashrc/.cshrc file.

I hope this helps.
Gus Correa


On Mar 19, 2011, at 8:15 PM, Svancara, Randall wrote:

> Hi
> 
> I did figure out the issue, or at least I am on the path to a solution.
> 
> I was assuming that when I submit a job via torque with the PBS parameter: #PBS -l nodes=12:ppn=12 that the PBS_NODEFILE parameter would have the correctly formatted hosts file for openmpi. 
> 
> What I am seeing is that torque will generate a hosts file that looks like this:
> 
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node164
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> node163
> ....
> 
> 
> But from what I can see, openmpi expects a hostfile list likethis:
> 
> node164 slots=12
> node163 slots=12
> 
> So what I had to do in my script is add the following code:
> 
> np=$(cat $PBS_NODEFILE | wc -l)
> 
> for i in `cat ${PBS_NODEFILE}|sort -u`; do
>   echo $i slots=12 > /home/admins/rsvancara/nodes
> done
> 
> /usr/mpi/intel/openmpi-1.4.3/bin/mpiexec $RUN $MCA -np $np -hostfile /home/admins/rsvancara/nodes /home/admins/rsvancara/TEST/mpitest
> 
> I guess I was expecting openmpi to do the right thing but apparently torque and openmpi are not on the same page in terms of formatting for a hosts file.  I am using version 2.4.7 of torque.  Would newer versions of torque correctly generate a hosts file?
> 
> The strange thing is that why would openmpi just simply tell me it may be a LD_LIBRARY_PATH problem seems rather vague.  A better response would be "What the .... am I supposed to do with this hosts file you idiot, please format it correctly".  
> 
> Best regards
> 
> Randall
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Shenglong Wang
> Sent: Fri 3/18/2011 8:45 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Torque environment problem
> 
> Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?
> 
> np=$(cat $PBS_NODEFILE | wc -l)
> 
> mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX
> 
> Best,
> 
> Shenglong
> 
> 
> 
> 
> On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:
> 
> > I just wanted to add that if I launch a job on one node, everything works fine.  For example in my job script if I specify
> >
> >
> > #PBS -l nodes=1:ppn=12
> >
> > Then everything runs fine.
> >
> >
> > However, if I specify two nodes, then everything fails.
> >
> >
> > #PBS -l nodes=1:ppn=12
> >
> > This also fails
> >
> >
> > #PBS -l nodes=13
> >
> > But this does not:
> >
> >
> > #PBS -l nodes=12
> >
> > Thanks,
> >
> > Randall
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
> > Sent: Fri 3/18/2011 7:48 PM
> > To: torqueusers at supercluster.org
> > Subject: [torqueusers] Torque environment problem
> >
> >
> > Hi,
> >
> > We are in the process of setting up a new cluster.   One issue I am experiencing is with openmpi jobs launched through torque.
> >
> > When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:
> >
> > **************************
> >
> > [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
> > --------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpirun was unable to cleanly terminate the daemons on the nodes shown
> > below. Additional manual cleanup may be required - please refer to
> > the "orte-clean" tool for assistance.
> > --------------------------------------------------------------------------
> >         node163 - daemon did not report back when launched
> > Completed executing:
> >
> > *************************
> >
> > However when launch a job running mpiexec, everything seems to work fine using the following script:
> >
> > /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest
> >
> > The job runs on 24 nodes with 12 processes per node.
> >
> > I have verified that my .bashrc is working.  I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success.  I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY". 
> >
> > My question is:
> >
> > 1.  Has anyone had this problem before (I am sure they have)
> > 2.  How would I go about troubleshooting this problem.
> >
> >
> > I am using torque version 2.4.7.
> >
> > Thanks for any assistance anyone can provide.
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110320/0617e0f6/attachment.html 


More information about the torqueusers mailing list