[torqueusers] Torque with OpenMPI
Jozef Káčer
quickparser at gmail.com
Thu Feb 21 08:51:35 MST 2008
The code is a standard example I found somewhere. Here it is:
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#define BUFSIZE 128
#define TAG 0
int main(int argc, char *argv[])
{
char idstr[32];
char buff[BUFSIZE];
char processor_name[MPI_MAX_PROCESSOR_NAME];
int namelen;
int numprocs;
int myid;
int i;
MPI_Status stat;
MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N'
processes exist thereafter */
MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD
world is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
MPI_Get_processor_name(processor_name, &namelen);
/* At this point, all the programs are running equivalently, the rank is
used to
distinguish the roles of the programs in the SPMD model, with rank 0
often used
specially... */
if(myid == 0)
{
printf("%d(%s): We have %d processors\n", myid, processor_name,
numprocs);
for(i=1;i<numprocs;i++)
{
sprintf(buff, "Hello %d! ", i, processor_name);
MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
}
for(i=1;i<numprocs;i++)
{
MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
printf("%d(%s): %s\n", myid, processor_name, buff);
}
}
else
{
/* receive from rank 0: */
MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
sprintf(idstr, "Processor %d (%s) reporting for duty\n", myid,
processor_name);
//strcat(buff, idstr);
//strcat(buff, "reporting for duty (%s)\n", processor_name);
strcat(buff, idstr);
/* send to rank 0: */
MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
}
MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak
synchronization point */
return 0;
}
It's really strange. I don't get. You know, the issue is that my colleague
managed to get it to
work. I followed the same manual (
http://debianclusters.cs.uni.edu/index.php/Using_a_Scheduler_and_Queue).
The openmpi was configured with these options:
./configure --enable-languages=c,c++,fortran,objc,obj-c++,treelang
--prefix=/usr/local --enable-shared --with-system-zlib
--libexecdir=/usr/local/lib --without-included-gettext
--enable-threads=posix --enable-nls --enable-__cxa_atexit
--enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr
--enable-checking=release x86_64-linux-gnu --with-tm=/usr/local
The thing is that torque worked fine without openmpi. I use maui as a
scheduler, and I have no problems with it as well.
Maybe I should turn to openmpi guys for help.
If anybody might know of anything that could help me I'm listening.
Thank you.
On Thu, Feb 21, 2008 at 3:03 PM, Craig West <cwest at astro.umass.edu> wrote:
>
> Hi Jozef,
>
> Not sure that I can really help any more. Its not something I've seen.
> At a guess it could be related to your network, it could be the code you
> are running, perhaps faulty RAM?
>
> The code is the first thing I am suspicious of. Does this code run
> correctly if you manually run it (without the queue) on the same nodes.
> I noticed that the last snippet you set was for 8 processors, and again
> processors 1-7 were listed, and it appears that the first node in the
> list was the one that crashed. It looks like processor 0 could be the
> problem, hence my concern with the code. If you want to send me the
> source code I'll try it here and see what happens.
>
>
> The other thing I noticed is that your computers are not time
> synchronized. I would suggest setting up ntp, its not a must but can
> make life easier, especially when tracking faults between computers, and
> building code.
>
> Cheers,
> Craig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080221/22595493/attachment.html
More information about the torqueusers
mailing list