[torqueusers] Torque and OpenMPI

Gijsbert Wiesenekker gijsbert.wiesenekker at gmail.com
Sat Jan 17 12:58:34 MST 2009


Brett Lee wrote:
> Gijsbert,  I believe Glen's suggestion answers your question.  Am 
> still learning myself, so at the risk of being wrong I'll direct you 
> to some MPI/OpenMP examples I've pieced together:
>
> http://www.etpenguin.com/pub/Clustering/HPC/Development/UserHome/pbs/scripts/ 
>
> -Brett
>
> Glen Beane wrote:
>> what does your torque script look like?  You should be specifying the
>> number of nodes.
>>
>> e.g. something like this:
>>
>> #!/bin/bash
>>
>> #PBS -l nodes=2:ppn=4
>>
>> cd $PBS_O_WORKDIR
>> mpiexec -n 8 ./a.out
>>
>> On Tue, Jan 13, 2009 at 6:30 AM, Gijsbert Wiesenekker
>> <gijsbert.wiesenekker at gmail.com> wrote:
>>> I have built a two-node Linux cluster with a Quad Core CPU each, 
>>> running
>>> Fedora Core 10, Torque and OpenMPI
>>> I have Torque and OpenMPI working on one node such that when I start 
>>> a job
>>> with
>>> mpiexec -n 4 a.out
>>> It runs 4 copies of a.out on one node.
>>> (BTW, I got the error:
>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: File open failure in file
>>> ras_tm_module.c at line 173
>>> [hostname:01936] pls:tm: failed to poll for a spawned proc, return 
>>> status =
>>> 17002
>>> [hostname:01936] [0,0,0] ORTE_ERROR_LOG: In errno in file rmgr_urm.c 
>>> at line
>>> 462
>>> [hostname:01936] mpiexec: spawn failed with errno=-11
>>> After trying all kinds of combinations of server parameters, queue
>>> parameters and qsub parameters it turned out that I had to add the 
>>> following
>>> line to my queue definition: set queue long resources_default.nodes 
>>> = 1)
>>>
>>> My question is how I can configure Torque such that when I start my 
>>> program
>>> with
>>> mpiexec -n 8 a.out
>>> It starts the job on each node running 4 copies of a.out each.
>>>
>>> Regards,
>>> Gijsbert
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>
Hi,

I have followed your suggestions, but I am now stuck with another error. 
First a couple of other questions:

It looks like I have to configure my batch queue with 
resources_default.nodes = 2, otherwise my jobs will stay in the Q state, 
is that correct?

Running interactively
mpiexec -n 4 - host first-node a.out  : -n 4 -host second-node a.out
works fine after disabling iptables. Which ports does mpiexec require?

When I submit the following job

#PBS -l nodes=2:ppn=4
mpiexec -n 8 a.out

Nothing happens, and after terminating the batch job the error file 
contains:
PBS: exec of shell '/usr/sbin/pbs_demux' failed.

A Google search suggested to use -nostdin -nostdout, but the Fedora Core 
mpiexec does not seem to support those options.
Any ideas?

Regards,
Gijsbert



More information about the torqueusers mailing list