[torqueusers] Torque and OpenMPI

Glen Beane glen.beane at gmail.com
Tue Jan 13 14:08:08 MST 2009


On Tue, Jan 13, 2009 at 4:00 PM, Gijsbert Wiesenekker
<gijsbert.wiesenekker at gmail.com> wrote:
> Brett Lee wrote:
>>
>> Gijsbert,  I believe Glen's suggestion answers your question.  Am still
>> learning myself, so at the risk of being wrong I'll direct you to some
>> MPI/OpenMP examples I've pieced together:
>>
>>
>> http://www.etpenguin.com/pub/Clustering/HPC/Development/UserHome/pbs/scripts/
>> -Brett
>>
>> Glen Beane wrote:
>>>
>>> what does your torque script look like?  You should be specifying the
>>> number of nodes.
>>>
>>> e.g. something like this:
>>>
>>> #!/bin/bash
>>>
>>> #PBS -l nodes=2:ppn=4
>>>
>>> cd $PBS_O_WORKDIR
>>> mpiexec -n 8 ./a.out
>>>
>>> On Tue, Jan 13, 2009 at 6:30 AM, Gijsbert Wiesenekker
>>> <gijsbert.wiesenekker at gmail.com> wrote:
>>>>
>>>> I have built a two-node Linux cluster with a Quad Core CPU each, running
>>>> Fedora Core 10, Torque and OpenMPI
>>>> I have Torque and OpenMPI working on one node such that when I start a
>>>> job
>>>> with
>>>> mpiexec -n 4 a.out
>>>> It runs 4 copies of a.out on one node.
>>>> (BTW, I got the error:
>>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: File open failure in file
>>>> ras_tm_module.c at line 173
>>>> [hostname:01936] pls:tm: failed to poll for a spawned proc, return
>>>> status =
>>>> 17002
>>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c at
>>>> line
>>>> 462
>>>> [hostname:01936] mpiexec: spawn failed with errno=-11
>>>> After trying all kinds of combinations of server parameters, queue
>>>> parameters and qsub parameters it turned out that I had to add the
>>>> following
>>>> line to my queue definition: set queue long resources_default.nodes = 1)
>>>>
>>>> My question is how I can configure Torque such that when I start my
>>>> program
>>>> with
>>>> mpiexec -n 8 a.out
>>>> It starts the job on each node running 4 copies of a.out each.
>>>>
>>>> Regards,
>>>> Gijsbert
>>>>
>>>> _______________________________________________
>>>> torqueusers mailing list
>>>> torqueusers at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
> OK. So I don't have to change the Torque queue definitions? I was thinking
> that I had to change a queue definition in some way so that it was aware of
> the two nodes.
> Does anyone know how this works? I submit a request to the queue of the
> first node, then  Torque checks the PBS_NODESFILE to start the request on
> the second node? In which queue is the request on the second node placed?


the queue does not have to know anything about the nodes, and the moms
don't need to know about the queue.  The queue "belongs" to
pbs_server.  You send the job to a queue on a pbs_server and it sends
the job to the pbs_moms.

if both nodes show up when you run pbsnodes -a then you are all set.
You request two quad core nodes for your job with -l nodes=2:ppn=4.
If you want one quad core node it would be -l nodes=1:ppn=4, or you
can request a "partial" node if you desire: qsub -l nodes=1:ppn=2 and
so on

and as someone mentioned you should build OpenMPI with TM support and
then use its mpirun to launch the job. Since it is tm-aware you will
not have to pass -np to mpirun, it will get that info from TORQUE


More information about the torqueusers mailing list