[torqueusers] Torque with OpenMPI

Jozef Káčer quickparser at gmail.com
Thu Feb 21 08:51:35 MST 2008


The code is a standard example I found somewhere. Here it is:

#include <mpi.h>
 #include <stdio.h>
 #include <string.h>

 #define BUFSIZE 128
 #define TAG 0

 int main(int argc, char *argv[])
 {
   char idstr[32];
   char buff[BUFSIZE];
   char processor_name[MPI_MAX_PROCESSOR_NAME];
   int namelen;
   int numprocs;
   int myid;
   int i;
   MPI_Status stat;

   MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N'
processes exist thereafter */
   MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD
world is */
   MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */
   MPI_Get_processor_name(processor_name, &namelen);

   /* At this point, all the programs are running equivalently, the rank is
used to
      distinguish the roles of the programs in the SPMD model, with rank 0
often used
      specially... */
   if(myid == 0)
   {
     printf("%d(%s): We have %d processors\n", myid, processor_name,
numprocs);
     for(i=1;i<numprocs;i++)
     {
       sprintf(buff, "Hello %d! ", i, processor_name);
       MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
     }
     for(i=1;i<numprocs;i++)
     {
       MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
       printf("%d(%s): %s\n", myid, processor_name, buff);
     }
   }
   else
   {
     /* receive from rank 0: */
     MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
     sprintf(idstr, "Processor %d (%s) reporting for duty\n", myid,
processor_name);
     //strcat(buff, idstr);
     //strcat(buff, "reporting for duty (%s)\n", processor_name);
     strcat(buff, idstr);
     /* send to rank 0: */
     MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
   }

   MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak
synchronization point */
   return 0;
 }

It's really strange. I don't get. You know, the issue is that my colleague
managed to get it to
work. I followed the same manual (
http://debianclusters.cs.uni.edu/index.php/Using_a_Scheduler_and_Queue).
The openmpi was configured with these options:

./configure --enable-languages=c,c++,fortran,objc,obj-c++,treelang
--prefix=/usr/local --enable-shared --with-system-zlib
--libexecdir=/usr/local/lib --without-included-gettext
--enable-threads=posix --enable-nls --enable-__cxa_atexit
--enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr
--enable-checking=release x86_64-linux-gnu --with-tm=/usr/local

The thing is that torque worked fine without openmpi. I use maui as a
scheduler, and I have no problems with it as well.
Maybe I should turn to openmpi guys for help.

If anybody might know of anything that could help me I'm listening.
Thank you.



On Thu, Feb 21, 2008 at 3:03 PM, Craig West <cwest at astro.umass.edu> wrote:

>
> Hi Jozef,
>
> Not sure that I can really help any more. Its not something I've seen.
> At a guess it could be related to your network, it could be the code you
> are running, perhaps faulty RAM?
>
> The code is the first thing I am suspicious of. Does this code run
> correctly if you manually run it (without the queue) on the same nodes.
> I noticed that the last snippet you set was for 8 processors, and again
> processors 1-7 were listed, and it appears that the first node in the
> list was the one that crashed. It looks like processor 0 could be the
> problem, hence my concern with the code. If you want to send me the
> source code I'll try it here and see what happens.
>
>
> The other thing I noticed is that your computers are not time
> synchronized. I would suggest setting up ntp, its not a must but can
> make life easier, especially when tracking faults between computers, and
> building code.
>
> Cheers,
> Craig.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080221/22595493/attachment.html


More information about the torqueusers mailing list