[torqueusers] Torque and OpenMPI

Gijsbert Wiesenekker gijsbert.wiesenekker at gmail.com
Sat Jan 17 16:41:36 MST 2009


Gijsbert Wiesenekker wrote:
> Brett Lee wrote:
>> Gijsbert,  I believe Glen's suggestion answers your question.  Am 
>> still learning myself, so at the risk of being wrong I'll direct you 
>> to some MPI/OpenMP examples I've pieced together:
>>
>> http://www.etpenguin.com/pub/Clustering/HPC/Development/UserHome/pbs/scripts/ 
>>
>> -Brett
>>
>> Glen Beane wrote:
>>> what does your torque script look like?  You should be specifying the
>>> number of nodes.
>>>
>>> e.g. something like this:
>>>
>>> #!/bin/bash
>>>
>>> #PBS -l nodes=2:ppn=4
>>>
>>> cd $PBS_O_WORKDIR
>>> mpiexec -n 8 ./a.out
>>>
>>> On Tue, Jan 13, 2009 at 6:30 AM, Gijsbert Wiesenekker
>>> <gijsbert.wiesenekker at gmail.com> wrote:
>>>> I have built a two-node Linux cluster with a Quad Core CPU each, 
>>>> running
>>>> Fedora Core 10, Torque and OpenMPI
>>>> I have Torque and OpenMPI working on one node such that when I 
>>>> start a job
>>>> with
>>>> mpiexec -n 4 a.out
>>>> It runs 4 copies of a.out on one node.
>>>> (BTW, I got the error:
>>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: File open failure in file
>>>> ras_tm_module.c at line 173
>>>> [hostname:01936] pls:tm: failed to poll for a spawned proc, return 
>>>> status =
>>>> 17002
>>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: In errno in file 
>>>> rmgr_urm.c at line
>>>> 462
>>>> [hostname:01936] mpiexec: spawn failed with errno=-11
>>>> After trying all kinds of combinations of server parameters, queue
>>>> parameters and qsub parameters it turned out that I had to add the 
>>>> following
>>>> line to my queue definition: set queue long resources_default.nodes 
>>>> = 1)
>>>>
>>>> My question is how I can configure Torque such that when I start my 
>>>> program
>>>> with
>>>> mpiexec -n 8 a.out
>>>> It starts the job on each node running 4 copies of a.out each.
>>>>
>>>> Regards,
>>>> Gijsbert
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> Hi,
>
> I have followed your suggestions, but I am now stuck with another 
> error. First a couple of other questions:
>
> It looks like I have to configure my batch queue with 
> resources_default.nodes = 2, otherwise my jobs will stay in the Q 
> state, is that correct?
>
> Running interactively
> mpiexec -n 4 - host first-node a.out  : -n 4 -host second-node a.out
> works fine after disabling iptables. Which ports does mpiexec require?
>
> When I submit the following job
>
> #PBS -l nodes=2:ppn=4
> mpiexec -n 8 a.out
>
> Nothing happens, and after terminating the batch job the error file 
> contains:
> PBS: exec of shell '/usr/sbin/pbs_demux' failed.
>
> A Google search suggested to use -nostdin -nostdout, but the Fedora 
> Core mpiexec does not seem to support those options.
> Any ideas?
>
> Regards,
> Gijsbert
>
>
I found the cause: /usr/sbin/pbs_demux was not installed.. I installed 
just the torque-mom rpm on the other node, and /usr/sbin/pbs_demux is 
part of the torque-client rpm. It works now, which leads to my next 
question, although it would perhaps be more appropriate for an MPI 
forum. My two nodes are not symmetric, the first node has a high-speed 
SSD disk that is used during a non-parallel synchronization step. My 
code defines the process with rank 0 as the master and assumes it has 
access to the SSD disk. Is there a way to force MPI to start the process 
with rank 0 on a particular node, or do I have to query the hostname of 
each process at program startup and choose one of the process running on 
the first node as the master?

Regards,
Gijsbert



More information about the torqueusers mailing list