[torqueusers] Torque with OpenMPI
quickparser at gmail.com
Thu Feb 21 08:51:35 MST 2008
The code is a standard example I found somewhere. Here it is:
#define BUFSIZE 128
#define TAG 0
int main(int argc, char *argv)
MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N'
processes exist thereafter */
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD
world is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
/* At this point, all the programs are running equivalently, the rank is
distinguish the roles of the programs in the SPMD model, with rank 0
if(myid == 0)
printf("%d(%s): We have %d processors\n", myid, processor_name,
sprintf(buff, "Hello %d! ", i, processor_name);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d(%s): %s\n", myid, processor_name, buff);
/* receive from rank 0: */
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d (%s) reporting for duty\n", myid,
//strcat(buff, "reporting for duty (%s)\n", processor_name);
/* send to rank 0: */
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak
synchronization point */
It's really strange. I don't get. You know, the issue is that my colleague
managed to get it to
work. I followed the same manual (
The openmpi was configured with these options:
--prefix=/usr/local --enable-shared --with-system-zlib
--enable-threads=posix --enable-nls --enable-__cxa_atexit
--enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr
--enable-checking=release x86_64-linux-gnu --with-tm=/usr/local
The thing is that torque worked fine without openmpi. I use maui as a
scheduler, and I have no problems with it as well.
Maybe I should turn to openmpi guys for help.
If anybody might know of anything that could help me I'm listening.
On Thu, Feb 21, 2008 at 3:03 PM, Craig West <cwest at astro.umass.edu> wrote:
> Hi Jozef,
> Not sure that I can really help any more. Its not something I've seen.
> At a guess it could be related to your network, it could be the code you
> are running, perhaps faulty RAM?
> The code is the first thing I am suspicious of. Does this code run
> correctly if you manually run it (without the queue) on the same nodes.
> I noticed that the last snippet you set was for 8 processors, and again
> processors 1-7 were listed, and it appears that the first node in the
> list was the one that crashed. It looks like processor 0 could be the
> problem, hence my concern with the code. If you want to send me the
> source code I'll try it here and see what happens.
> The other thing I noticed is that your computers are not time
> synchronized. I would suggest setting up ntp, its not a must but can
> make life easier, especially when tracking faults between computers, and
> building code.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers