[torqueusers] Torque environment problem
Svancara, Randall
rsvancara at wsu.edu
Sat Mar 19 00:59:17 MDT 2011
Hi,
Thanks for the quick reply.
Here is my LD_LIBRARY_PATH:
LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
I am using modules, so I am not sure if that is causing me any issues:
. /home/software/Modules/default/init/bash
. /home/software/modulefiles/.defaultmodules
module add null intel/11.1.075 openmpi/1.4.3_intel
I tried putting this into my .basrc as well:
LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
It seems like when I launch jobs via qsub, .bashrc is read. But when an mpi job spans more than one node, then it fails to find the correct environment variables. When I run my mpitest without qsub, then I can run on more than one node. So I am not understanding what the difference is between when I run MPI through torque/qsub and from the standard command line.
In addition I did attempt Shenglong's suggestions without any luck.
Thanks again
-----Original Message-----
From: torqueusers-bounces at supercluster.org on behalf of Shenglong Wang
Sent: Fri 3/18/2011 8:45 PM
To: Torque Users Mailing List
Subject: Re: [torqueusers] Torque environment problem
Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?
np=$(cat $PBS_NODEFILE | wc -l)
mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX
Best,
Shenglong
On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:
> I just wanted to add that if I launch a job on one node, everything works fine. For example in my job script if I specify
>
>
> #PBS -l nodes=1:ppn=12
>
> Then everything runs fine.
>
>
> However, if I specify two nodes, then everything fails.
>
>
> #PBS -l nodes=1:ppn=12
>
> This also fails
>
>
> #PBS -l nodes=13
>
> But this does not:
>
>
> #PBS -l nodes=12
>
> Thanks,
>
> Randall
>
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
> Sent: Fri 3/18/2011 7:48 PM
> To: torqueusers at supercluster.org
> Subject: [torqueusers] Torque environment problem
>
>
> Hi,
>
> We are in the process of setting up a new cluster. One issue I am experiencing is with openmpi jobs launched through torque.
>
> When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:
>
> **************************
>
> [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> node163 - daemon did not report back when launched
> Completed executing:
>
> *************************
>
> However when launch a job running mpiexec, everything seems to work fine using the following script:
>
> /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest
>
> The job runs on 24 nodes with 12 processes per node.
>
> I have verified that my .bashrc is working. I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success. I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".
>
> My question is:
>
> 1. Has anyone had this problem before (I am sure they have)
> 2. How would I go about troubleshooting this problem.
>
>
> I am using torque version 2.4.7.
>
> Thanks for any assistance anyone can provide.
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20110318/37e8f729/attachment.html
More information about the torqueusers
mailing list