[torquedev] problems on running mpiexec with Torque-2.1.1

chenyong cy163 at hotmail.com
Sat May 10 20:14:19 MDT 2008

Hello All, The post concerns my questions on Torque.  I am new to Torque. Before, I have been programming parallel program with MPI + C++ on a IBM BladeCenter cluster(running Linux) where Torque-2.1.1 is installed.The cluster consists of a admin node and 14 computing nodes.  to acquaint myself with Torque, i made a simple MPI + C++ program (just creating files) looking like the following //    #####   Test_PBS_Pgm.cpp  ###### #include <string>#include <string.h>#include <iostream>int main(int argc, char *argv[]){     int User_Close_GtkWindow_MPI = 0; int Query_FileName_Len; int i;    int    myid, numprocs;        double startwtime = 0.0, endwtime;    int    namelen;    char   processor_name[MPI_MAX_PROCESSOR_NAME];     MPI_Init(&argc,&argv);//MPI    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);//MPI    MPI_Comm_rank(MPI_COMM_WORLD,&myid);//MPI    MPI_Get_processor_name(processor_name,&namelen); //MPI  // CREATE FILES, File names depend on individual process ID
 std::string InPut_FileName= "/home/yongchen/Temp___"; char strMyID[10];  sprintf(strMyID,"%d",myid ); InPut_FileName += strMyID; FILE * fileSegedDoc; fileSegedDoc =fopen(InPut_FileName.c_str(), "w"); fclose(fileSegedDoc);
////////////////////////////////////////////////////////////////////////// MPI_Finalize();//////////////////////////////////////////////////////////////////////////}
 It was observed that the program can run successfully with the following command      mpiexec -n NUMBER_NODES ./Test_PBS_Pgm   Later, I tried to submit the job using Torque.Unfortunately, I encounter some problems. I initiate 14 computing nodes using the command     mpdboot -n 14 Then I use the command     mpdtrace to see whether all 14 computing nodes are ready 'mpdtrace' will produce a list of node names of running computing nodes. The list is shown below.Actually, the list is the same as the contents in the file mpd.hosts (mpd.hosts will be used in a script later) hpc01hpc02hpc03hpc04hpc05hpc06hpc07hpc08hpc09hpc10hpc11hpc13hpc14  the script for submitting the job is shown below 
// ######  PBS_Test.sh  ##################!/bin/bash#PBS -l nodes=3:ppn=1#PBS -VNCPUS=3PBS_O_WORKDIR=/home/yong/tempcd $PBS_O_WORKDIRmyPROG='/home/yong/Test_PBS_Pgm'MPIRUN='/home/yong/mpich2-1.0.5p3/bin/mpiexec'$MPIRUN -np $NCPUS -machinefile ../mpd.hosts $myPROG >& out2   I use       qsub PBS_Test.sh to submit the job After submitting, I got a job number like YY.xcat1 (xcat1 is the name of admin node) It is observed that Temp___0 Temp___1  Temp___2 three files can be created successfully. However, when I run 'mpdtrace' again.the node names  hpc12, hpc13, and hpc14 are absent in the list comaring to the previous list. If I submit the job again,i.e., executing the command 'qsub  PBS_Test.sh'again. The job will failed. This means that after executing the job, hpc12, hpc13, and hpc14 exit from the computing node community for unknown reason. This is confirmed by the subsequent observation.  In the file out2, I can see the following error message  mpiexec-hpc14: cannot connect to local mpd (/tmp/mpd2.console.yong)possilbe causes:1. no mpd is running on this host2. an mpd is running but ws stated without a "console"    However, I can see all 14 nodes is 'free' with the command 'pbsnodes'  In addition,in output file Test_PBS_Pgm.oYY, I can see that hpc12, hpc13, and hpc14 were used for this job, but the name of created files were Temp___0 Temp___1  Temp___2.  Another problem is concerned with execution speed.  It is very quick to run the job with directly using 'mpiexec -n NUMBER_NODES ./Test_PBS_Pgm ' However, it will take 10 seconds if I submit the job using 'qsub'.  I execute 'qstat' after I execut 'qsub PBS_Test.sh'. I observed that the state of job had been in 'Q' state (i.e.,waiting in the queue for execution) for serveral seconds. However, there were no other jobs in the queue at all.  These problems have sticked me for a long time. Please help me. I appreciated any help very much  Felix
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20080511/e8a077c5/attachment.html

More information about the torquedev mailing list