[torqueusers] Torque environment problem

Gustavo Correa gus at ldeo.columbia.edu
Mon Mar 21 00:28:38 MDT 2011


On Mar 21, 2011, at 1:55 AM, Svancara, Randall wrote:

> Thanks for the help.  Now that you mention the temporary directory stuff, I can see where that would be a problem.  There are prolog and epilogue scripts I could set for the /tmp directories.  But I will do some more research. 
> 
> Thanks for you all your help
> 
> 
Hi Randall

You're welcome.
A single word, 'stateless', on your long email made the whole difference.
These "Aha!" moments are  beauty of these community mailing lists.  :)

Good luck,
Gus Correa

> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org on behalf of Gustavo Correa
> Sent: Sun 3/20/2011 10:45 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Torque environment problem
> 
> Hi Randall
> 
> Our nodes are state full, with physical disks.
> Hence, I don't have hands on experience running Torque and OpenMPI
> on stateless hosts.
> However, my recollection is that other people on this and other mailing lists
> (including the OpenMPI list) run both on stateless clusters.
> If you search the list archives you will find some postings about it.
> 
> In any case, please see these OpenMPI FAQ about the role of /tmp
> and how it plays on stateless clusters:
> 
> http://www.open-mpi.org/faq/?category=all#poor-sm-btl-performance
> http://www.open-mpi.org/faq/?category=all#network-vs-local
> 
> That may be perhaps the problem, in case the 'tm' plm has a larger footprint on
> /tmp than the ssh/rsh plm.
> 
> Maybe a Torque developer has a better insight?
> Does Torque also use /tmp under the hood?
> 
> Also, you may want to ask about this in the OpenMPI list, and make sure you
> highlight the fact that your cluster is stateless/diskless.
> The list is very active.  Worth trying:
> 
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> http://www.open-mpi.org/community/lists/users/
> 
> I hope this helps,
> Gus Correa
> 
> PS - Regarding Slurm, I thought you had it *running* along with Torque.
> 
> 
> On Mar 20, 2011, at 10:41 PM, Svancara, Randall wrote:
> 
> > Hi thanks for the reply.
> >
> > 1.  Yes we are using passwordless ssh.
> >
> > [rsvancara at node1 ~]$ ssh node1
> > Last login: Sun Mar 20 11:02:49 2011 from login1
> > [rsvancara at node1 ~]$ ssh node2
> > Last login: Sun Mar 20 00:42:06 2011 from mgt2.wsuhpc.edu
> > [rsvancara at node2 ~]$ ssh node3
> > Last login: Thu Mar 17 17:35:59 2011 from login1
> > [rsvancara at node3 ~]$
> >
> > 2.  I was thinking that torque launched the processes for openmpi.
> >
> > I am rebuilding openmpi with the following flags
> >
> > ./configure --prefix=/home/software/mpi/intel/openmpi-1.4.3 CC=icc CXX=icpc F77=ifort FC=ifort --without-slurm --with-tm=/usr/local --with-openib 2>&1 | tee configtee.log
> >
> > I have confirmed the slurm is not included.  However it is a default option and that is why it shows up.  In terms of rsh, my understanding of openmpi is that it will use ssh first then rsh.  So rsh is a bit misleading.
> >
> > Also I just wanted to point out that it says this in the OpenMPI FAQ:
> >
> > v1.3 series: The orte_rsh_agent MCA parameter accepts a colon-delimited list of programs to search for in your path to use as the remote startup agent (the MCA parameter name plm_rsh_agent also works, but it is deprecated). The default value is "ssh : rsh", meaning that it will look for ssh first, and if it doesn't find it, use rsh. You can change the value of this parameter as relevant to your environment, such as simply changing it to rsh or rsh : ssh if you have a mixture.
> >
> >
> > 3.  Here is the output of env
> >
> > [rsvancara at login1 ~]$ env
> > MODULE_VERSION_STACK=3.2.8
> > MANPATH=/usr/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
> > HOSTNAME=login1
> > TERM=xterm
> > SHELL=/bin/bash
> > HISTSIZE=1000
> > SSH_CLIENT=134.121.12.6 46870 22
> > SSH_TTY=/dev/pts/1
> > USER=rsvancara
> > LD_LIBRARY_PATH=/usr/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
> > LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
> > CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
> > NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
> > MODULE_VERSION=3.2.8
> > MAIL=/var/spool/mail/rsvancara
> > PATH=/usr/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> > INPUTRC=/etc/inputrc
> > PWD=/home/admins/rsvancara
> > _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
> > LANG=en_US.UTF-8
> > MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
> > LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
> > SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
> > SHLVL=1
> > HOME=/home/admins/rsvancara
> > INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
> > DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
> > LOGNAME=rsvancara
> > SSH_CONNECTION=134.121.12.6 46870 134.121.141.14 22
> > MODULESHOME=/usr/mpi/intel/openmpi-1.4.3
> > LESSOPEN=|/usr/bin/lesspipe.sh %s
> > G_BROKEN_FILENAMES=1
> > module=() {  eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
> > }
> > _=/bin/env
> >
> > 4.  All the nodes are stateless.  They use the same stateless image.  So they are identical.  However, the mpi installation is located on the stateless image.  I could put it on a shared filesystem, which would probably be better anyways since it would keep the stateless images smaller.
> >
> > 5.  Home directories are provided via shared filesystem.  I have set up modules and .bashrc file like this
> >
> > . /home/software/Modules/default/init/bash
> > . /home/software/modulefiles/.defaultmodules
> > module add null intel/11.1.075 openmpi/1.4.3_intel
> >
> > 6. Yes, they are all working.  Just did a:
> >
> > pbsnodes -a |less
> >
> > (yes, it looks ugly)
> >
> > 7.  The minimal test fails.
> >
> > [node164:11193] plm:tm: failed to poll for a spawned daemon, return status = 17002
> > --------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpiexec noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpiexec was unable to cleanly terminate the daemons on the nodes shown
> > below. Additional manual cleanup may be required - please refer to
> > the "orte-clean" tool for assistance.
> > --------------------------------------------------------------------------
> >         node163 - daemon did not report back when launched
> >         node159 - daemon did not report back when launched
> >         node158 - daemon did not report back when launched
> >         node157 - daemon did not report back when launched
> >         node156 - daemon did not report back when launched
> >         node155 - daemon did not report back when launched
> >         node154 - daemon did not report back when launched
> >         node152 - daemon did not report back when launched
> >         node151 - daemon did not report back when launched
> >         node150 - daemon did not report back when launched
> >         node149 - daemon did not report back when launched
> >
> >
> > But if I include -mca plm rsh then it runs just fine.
> >
> > 8.  I compiled openmpi with the option of "--without-slurm --with-tm=/usr/local" and ompi_info shows:
> >
> > [rsvancara at login1 ~]$ ompi_info |grep plm
> >                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
> >                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
> >
> > Is there any addition configuration I need to do with torque.  I tried running an interactive job like this:
> >
> > [rsvancara at login1 ~]$ qsub -I -lnodes=2:ppn=12
> > qsub: waiting for job 1663.mgt1.wsuhpc.edu to start
> > qsub: job 1663.mgt1.wsuhpc.edu ready
> >
> > [rsvancara at node164 ~]$ mpirun -mca plm tm TEST/mpitest
> > [node164:12783] plm:tm: failed to poll for a spawned daemon, return status = 17002
> > --------------------------------------------------------------------------
> > A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > mpirun was unable to cleanly terminate the daemons on the nodes shown
> > below. Additional manual cleanup may be required - please refer to
> > the "orte-clean" tool for assistance.
> > --------------------------------------------------------------------------
> >         node163 - daemon did not report back when launched
> >
> >
> > But when I do:
> >
> > [rsvancara at node164 ~]$ mpirun -mca plm rsh TEST/mpitest
> > Greetings from process 1!
> > Greetings from process 2!
> > Greetings from process 3!
> > Greetings from process 4!
> > Greetings from process 5!
> > Greetings from process 6!
> > Greetings from process 7!
> > Greetings from process 8!
> > Greetings from process 9!
> > Greetings from process 10!
> > Greetings from process 11!
> > Greetings from process 12!
> > Greetings from process 13!
> > Greetings from process 14!
> > Greetings from process 15!
> > Greetings from process 16!
> > Greetings from process 17!
> > Greetings from process 18!
> > Greetings from process 19!
> > Greetings from process 20!
> > Greetings from process 21!
> > Greetings from process 22!
> > Greetings from process 23!
> >
> > [rsvancara at node164 ~]$ mpirun -mca plm rsh hostname   
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node164
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> > node163
> >
> >
> > So would it be safe to assume that something is broken in openmpi 1.4.3?  The TM module is not able to figure out how to launch jobs remotely using TM?
> >
> > Thanks,
> >
> > Randall
> >
> >
> >
> >
> > -----Original Message-----
> > From: torqueusers-bounces at supercluster.org on behalf of Gustavo Correa
> > Sent: Sun 3/20/2011 5:06 PM
> > To: Torque Users Mailing List
> > Subject: Re: [torqueusers] Torque environment problem
> >
> > Hi Randall
> >
> > I don't think there is a Torque/OpenMPI compatibility problem.
> > We have Torque 2.4.11 and OpenMPI 1.4.3 on Linux x86_64.
> > They work hand in hand, no problem at all.
> > OpenMPI picks exactly the nodes and cores provided by Torque.
> > There is no need to concoct a hostfile.
> > In the past I used several combinations of previous (and newer) versions
> > of OpenMPI compiled with support for different versions of Torque with no
> > problem either (Torque 2.3,6, and 2.5.4, OpenMPI 1.2.8, 1.3.2, 1.3.3, 1.4.2).
> >
> > A bunch of guesses below.
> >
> > 1) Passwordless ssh
> >
> > For OpenMPI to work right you must setup passwordless ssh across the nodes.
> > Is this working?
> > Can you ssh without password across all pairs of nodes?
> >
> > 2) ssh
> >
> > Note 'ssh', not 'rsh'.
> > I don't know if Torque/OpenMPI integration works if you switch from ssh to rsh to connect
> > across nodes.
> >
> > 3) various MPI versions.
> >
> > Also, make sure you are not using another version of mpiexec inadvertently.
> > Perhaps one that doesn't have Torque support.
> > Compilers, and some Linux distributions may come with these extras.
> > You could either use full path names to mpicc when you compile, and mpiexec
> > when you run the program, or make sure OpenMPI 'bin'  and 'lib' directories come
> > first in your PATH and LD_LIBRARY_PATH.
> >
> > 4) same OpenMPI on all nodes
> >
> > In addition, both the very same OpenMPI installation must be available on all nodes.
> > Either installed exactly the same way on all nodes in local directories,
> > or with a single centralized installation shared by the nodes via NFS or equivalent.
> >
> > 5) same environment on all nodes
> >
> > Furthermore, your environment must be consistent (preferably the same) on all nodes.
> > If you have separate home directories on each node, you need to set your .bashrc/.cshrc
> > on all nodes to point to the OpenMPI bin and lib directories (local directories).
> > If home is exported by the head node and mounted all all nodes via NFS,
> > then a single .bashrc/.cshrc on the head node will suffice.
> >
> > 6) are the pbs_mom daemons working on all nodes?
> >
> > Another check is to look for the output of 'pbsnodes -a',
> > and see if all nodes report back.
> >
> > 7) minimalist test
> >
> > A minimalist test of OpenMPI with Torque is
> >
> > #PBS -q myqueue at myhost.mydomain
> > #PBS -l nodes=$NODES:ppn=$CORES
> > mpiexec -np $TOTAL_CORES hostname
> >
> > You don't need the -mca plm tm
> >
> > It should list the names of all hosts as many times as the number of cores per node.
> >
> > 8) Torque and Slurm?
> >
> > Finally, are you running slurm along with Torque?
> > If you are, I wonder if this may cause a conflict.
> > [Any] Two resource managers most likely won't get together with each other.
> >
> > I hope this helps,
> > Gus Correa
> >
> >
> > On Mar 20, 2011, at 2:32 PM, Svancara, Randall wrote:
> >
> > > Hi,
> > >
> > > I have recompiled openmpi-1.4.3 with tm support.
> > >
> > > I have confirmed that it is available via:
> > >
> > > [rsvancara at node1 ~]$ ompi_info |grep plm
> > >                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
> > >                  MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
> > >                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
> > >
> > > [rsvancara at node164 ~]$ ompi_info |grep plm
> > >                  MCA plm: rsh (MCA v2.0, API v2.0, Component v1.4.3)
> > >                  MCA plm: slurm (MCA v2.0, API v2.0, Component v1.4.3)
> > >                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.4.3)
> > >
> > >
> > > When I launch jobs using openmpi, I have to use:
> > >
> > > -mca plm rsh
> > >
> > > If I set this to
> > >
> > > -mca plm tm
> > >
> > > Then no remote processes are launched.  I do not mind using rsh, however, I would prefer to have torque "Do the right thing" and just work.  I am using torque version 2.4.7.
> > >
> > > Is this a torque/openmpi compatibility issue?  Or is this how torque is supposed to work with openmpi?  I thought torque would launch the remote processes and clean them up after.
> > >
> > > I would appreciate any suggestions.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: torqueusers-bounces at supercluster.org on behalf of Gustavo Correa
> > > Sent: Sat 3/19/2011 5:48 PM
> > > To: Torque Users Mailing List
> > > Subject: Re: [torqueusers] Torque environment problem
> > >
> > > Hi Randall
> > >
> > > If you build OpenMPI with Torque support, mpiexec will use the
> > > nodes and processors provided by Torque, and you don't
> > > need to provide any hostfile whatsoever.
> > > We've been using OpenMPI with Torque support for quite a while.
> > >
> > > To do so, you need to configure OpenMPI this way:
> > >
> > > ./configure --prefix=/directory/to/install/openmpi --with-tm=/directory/where/you/installed/torque
> > >
> > > See the OpenMPI FAQ about this:
> > > http://www.open-mpi.org/faq/?category=building#build-rte-tm
> > >
> > > Still, although your script to restore the "np=$NPROC" syntax is very clever,
> > > I guess you could use directly the $PBS_NODEFILE as your hostfile,
> > > when OpenMPI is not built with Torque support.
> > >
> > > The issue with LD_LIBRARY_PATH may be in addition to the nodefile mismatch
> > > problem you had.
> > > OpenMPI requires both PATH and LD_LIBRARY_PATH to be set on all hosts
> > > where the parallel program runs:
> > >
> > > http://www.open-mpi.org/faq/?category=running#adding-ompi-to-path
> > >
> > > If your home directory is NFS mounted on your cluster,
> > > the easy way to do it is to set both in your .bashrc/.cshrc file.
> > >
> > > I hope this helps.
> > > Gus Correa
> > >
> > >
> > > On Mar 19, 2011, at 8:15 PM, Svancara, Randall wrote:
> > >
> > > > Hi
> > > >
> > > > I did figure out the issue, or at least I am on the path to a solution.
> > > >
> > > > I was assuming that when I submit a job via torque with the PBS parameter: #PBS -l nodes=12:ppn=12 that the PBS_NODEFILE parameter would have the correctly formatted hosts file for openmpi.
> > > >
> > > > What I am seeing is that torque will generate a hosts file that looks like this:
> > > >
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node164
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > node163
> > > > ....
> > > >
> > > >
> > > > But from what I can see, openmpi expects a hostfile list likethis:
> > > >
> > > > node164 slots=12
> > > > node163 slots=12
> > > >
> > > > So what I had to do in my script is add the following code:
> > > >
> > > > np=$(cat $PBS_NODEFILE | wc -l)
> > > >
> > > > for i in `cat ${PBS_NODEFILE}|sort -u`; do
> > > >   echo $i slots=12 > /home/admins/rsvancara/nodes
> > > > done
> > > >
> > > > /usr/mpi/intel/openmpi-1.4.3/bin/mpiexec $RUN $MCA -np $np -hostfile /home/admins/rsvancara/nodes /home/admins/rsvancara/TEST/mpitest
> > > >
> > > > I guess I was expecting openmpi to do the right thing but apparently torque and openmpi are not on the same page in terms of formatting for a hosts file.  I am using version 2.4.7 of torque.  Would newer versions of torque correctly generate a hosts file?
> > > >
> > > > The strange thing is that why would openmpi just simply tell me it may be a LD_LIBRARY_PATH problem seems rather vague.  A better response would be "What the .... am I supposed to do with this hosts file you idiot, please format it correctly".
> > > >
> > > > Best regards
> > > >
> > > > Randall
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: torqueusers-bounces at supercluster.org on behalf of Shenglong Wang
> > > > Sent: Fri 3/18/2011 8:45 PM
> > > > To: Torque Users Mailing List
> > > > Subject: Re: [torqueusers] Torque environment problem
> > > >
> > > > Have you set LD_LIBRARY_PATH in your ~/.bashrc file? Did you try to include LD_LIBRARY_PATH to mpirun or mpiexec?
> > > >
> > > > np=$(cat $PBS_NODEFILE | wc -l)
> > > >
> > > > mpiexec -np $np -hostfile $PBS_NODEFILE env LD_LIBRARY_PATH=$LD_LIBRARY_PATH XXXX
> > > >
> > > > Best,
> > > >
> > > > Shenglong
> > > >
> > > >
> > > >
> > > >
> > > > On Mar 18, 2011, at 11:36 PM, Svancara, Randall wrote:
> > > >
> > > > > I just wanted to add that if I launch a job on one node, everything works fine.  For example in my job script if I specify
> > > > >
> > > > >
> > > > > #PBS -l nodes=1:ppn=12
> > > > >
> > > > > Then everything runs fine.
> > > > >
> > > > >
> > > > > However, if I specify two nodes, then everything fails.
> > > > >
> > > > >
> > > > > #PBS -l nodes=1:ppn=12
> > > > >
> > > > > This also fails
> > > > >
> > > > >
> > > > > #PBS -l nodes=13
> > > > >
> > > > > But this does not:
> > > > >
> > > > >
> > > > > #PBS -l nodes=12
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Randall
> > > > >
> > > > > -----Original Message-----
> > > > > From: torqueusers-bounces at supercluster.org on behalf of Svancara, Randall
> > > > > Sent: Fri 3/18/2011 7:48 PM
> > > > > To: torqueusers at supercluster.org
> > > > > Subject: [torqueusers] Torque environment problem
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > We are in the process of setting up a new cluster.   One issue I am experiencing is with openmpi jobs launched through torque.
> > > > >
> > > > > When I launch a simple job using a very basic mpi "Hello World" script I am seeing the following errors from openmpi:
> > > > >
> > > > > **************************
> > > > >
> > > > > [node164:06689] plm:tm: failed to poll for a spawned daemon, return status = 17002
> > > > > --------------------------------------------------------------------------
> > > > > A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> > > > > launch so we are aborting.
> > > > >
> > > > > There may be more information reported by the environment (see above).
> > > > >
> > > > > This may be because the daemon was unable to find all the needed shared
> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> > > > > location of the shared libraries on the remote nodes and this will
> > > > > automatically be forwarded to the remote nodes.
> > > > > --------------------------------------------------------------------------
> > > > > --------------------------------------------------------------------------
> > > > > mpirun noticed that the job aborted, but has no info as to the process
> > > > > that caused that situation.
> > > > > --------------------------------------------------------------------------
> > > > > --------------------------------------------------------------------------
> > > > > mpirun was unable to cleanly terminate the daemons on the nodes shown
> > > > > below. Additional manual cleanup may be required - please refer to
> > > > > the "orte-clean" tool for assistance.
> > > > > --------------------------------------------------------------------------
> > > > >         node163 - daemon did not report back when launched
> > > > > Completed executing:
> > > > >
> > > > > *************************
> > > > >
> > > > > However when launch a job running mpiexec, everything seems to work fine using the following script:
> > > > >
> > > > > /usr/mpi/intel/openmpi-1.4.3/bin/mpirun -hostfile /home/admins/rsvancara/hosts -n 24 /home/admins/rsvancara/TEST/mpitest
> > > > >
> > > > > The job runs on 24 nodes with 12 processes per node.
> > > > >
> > > > > I have verified that my .bashrc is working.  I have tried to launch from an interactive job using qsub -I -lnodes=12:ppn12 without any success.  I am assuming this is an environment problem, however, I am unsure as the openmpi error includes "MAY".
> > > > >
> > > > > My question is:
> > > > >
> > > > > 1.  Has anyone had this problem before (I am sure they have)
> > > > > 2.  How would I go about troubleshooting this problem.
> > > > >
> > > > >
> > > > > I am using torque version 2.4.7.
> > > > >
> > > > > Thanks for any assistance anyone can provide.
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > torqueusers mailing list
> > > > > torqueusers at supercluster.org
> > > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > torqueusers mailing list
> > > > torqueusers at supercluster.org
> > > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> > >
> > >
> > > _______________________________________________
> > > torqueusers mailing list
> > > torqueusers at supercluster.org
> > > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list