[torqueusers] Torque environment problem

Svancara, Randall rsvancara at wsu.edu
Sat Mar 19 00:59:17 MDT 2011


Hi,

Thanks for the quick reply.  

Here is my LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib

I am using modules, so I am not sure if that is causing me any issues:

. /home/software/Modules/default/init/bash
. /home/software/modulefiles/.defaultmodules
module add null intel/11.1.075 openmpi/1.4.3_intel

I tried putting this into my .basrc as well:

LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib


It seems like when I launch jobs via qsub, .bashrc is read.  But when an mpi job spans more than one node, then it fails to find the correct environment variables.  When I run my mpitest without qsub, then I can run on more than one node.  So I am not understanding what the difference is between when I run MPI through torque/qsub and from the standard command line.  

In addition I did attempt Shenglong's suggestions without any luck.  

Thanks again





-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of Shenglong Wang
Sent: Fri 3/18/2011 8:45 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Torque environment problem
 
Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?

np=$(cat $PBS_NODEFILE | wc -l)

mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX

Best,

Shenglong




On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:

> I just wanted to add that if I launch a job on one node, everything works fine.  For example in my job script if I specify
> 
> 
> #PBS -l nodes=1:ppn=12
> 
> Then everything runs fine.
> 
> 
> However, if I specify two nodes, then everything fails. 
> 
> 
> #PBS -l nodes=1:ppn=12
> 
> This also fails
> 
> 
> #PBS -l nodes=13
> 
> But this does not:
> 
> 
> #PBS -l nodes=12
> 
> Thanks,
> 
> Randall
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
> Sent: Fri 3/18/2011 7:48 PM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Torque environment problem
> 
> 
> Hi,
> 
> We are in the process of setting up a new cluster.   One issue I am experiencing is with openmpi jobs launched through torque. 
> 
> When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:
> 
> **************************
> 
> [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         node163 - daemon did not report back when launched
> Completed executing:
> 
> *************************
> 
> However when launch a job running mpiexec, everything seems to work fine using the following script:
> 
> /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest
> 
> The job runs on 24 nodes with 12 processes per node. 
> 
> I have verified that my .bashrc is working.  I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success.  I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".  
> 
> My question is:
> 
> 1.  Has anyone had this problem before (I am sure they have)
> 2.  How would I go about troubleshooting this problem. 
> 
> 
> I am using torque version 2.4.7.
> 
> Thanks for any assistance anyone can provide.
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110318/37e8f729/attachment.html 


More information about the torqueusers mailing list