[torqueusers] problem with shared libraries
Jan
jand at uvic.ca
Mon Feb 11 10:31:46 MST 2008
Hi,
> You are using evil $LD_LIBRARY_PATH that shows up in your interactive
shell but
> not in the non-interactive batch shell?
Yes, thanks! I can't believe I did not realize that. Anyway, that works
now.
> Which MPI implementation? All MPI implementations need to know which
machines
> to use and how many processes to spawn. Some, like openmpi, have PBS
support
> and can get that information directly. Others, like MPICH, need to
be told
> with '-machinefile $PBS_NODEFILE -np X'. OSC has an mpiexec that
talks to PBS
> and can launch MPICH jobs.
I am using openmpi 1.2.3 and torque 2.2.1. Now, in a submit script doing
this:
#PBS -l nodes=2:ppn=8
/usr/local/bin/mpirun --prefix /usr/local ./fgs > ./test.log
starts the job correctly on both nodes, using a total of 16 cpus
However,
#PBS -l nodes=1:ppn=8
/usr/local/bin/mpirun --prefix /usr/local ./fgs > ./test.log
starts the job on the first node on one CPU. Any ideas?
Thank you very much, Jan
Garrick Staples wrote:
> On Sun, Feb 10, 2008 at 03:30:06PM -0800, Jan alleged:
>> Hi,
>>
>> I am in the (slow) process of setting up my first cluster. So far, I
>> have 2 machines with 8 cpus each (running ubuntu 7.10). One machine is a
>> server and a node (node1) at once, the other one is a node (node2).
>> pbsnodes -a reports both nodes as working. node1 /home is mounted via
>> nfs onto node2 . When I look over the log files in
>> /var/spool/torque/*_logs/ I cannot find anything obviously wrong.
>>
>> I compiled pbs, and installed it. I configured everything (setting the
>> server name etc. on both machines) following the online documentation.
>>
>> Now I seem to have two problems:
>> 1) if I submit a script such as:
>> #PBS -l nodes=1:ppn=8
>> #PBS -l walltime=96:00:00
>> #PBS -j oe
>>
>> # change the current working directory to the directory where
>> # the executable file 'hello' can be found
>> cd $PBS_O_WORKDIR
>> echo $PBS_O_WORKDIR
>>
>> # run the executable file 'hello' using the qmpirun script
>> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
>>
>> everything works. The code runs on 8 CPUs and I get the expected results
>> from my code.
>>
>> If I omit the "-np 8" the code only runs on one cpu. I did not expect
>> that behaviour since I specified ppn=8 above.
>> Any suggestions as to why ppn=8 does not work?
>
> Which MPI implementation? All MPI implementations need to know which machines
> to use and how many processes to spawn. Some, like openmpi, have PBS support
> and can get that information directly. Others, like MPICH, need to be told
> with '-machinefile $PBS_NODEFILE -np X'. OSC has an mpiexec that talks to PBS
> and can launch MPICH jobs.
>
>
>> 2) if I submit
>> #PBS -l nodes=2:ppn=1
>> #PBS -l walltime=96:00:00
>> #PBS -j oe
>>
>> # change the current working directory to the directory where
>> # the executable file 'hello' can be found
>> cd $PBS_O_WORKDIR
>> echo $PBS_O_WORKDIR
>>
>> # run the executable file 'hello' using the qmpirun script
>> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
>>
>> qstat indicates that the job is running but the code is not being
>> executed. If I qdel the job, the error file indicates that
>> a shared lib is missing:
>> fgs: error while loading shared libraries: libimf.so: cannot open shared
>> object file: No such file or directory
>>
>> I assume that this happens on node2. However, if I log into the node and
>> execute the job directly with mpirun, it runs as expected.
>
> You are using evil $LD_LIBRARY_PATH that shows up in your interactive shell but
> not in the non-interactive batch shell?
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Jan Dettmer, Postdoctoral Fellow
School of Earth and Ocean Sciences, University of Victoria
Victoria, BC V8W 3P6
office: (250) 472-4342 email: jand at uvic.ca
http://web.uvic.ca/~jand/
More information about the torqueusers
mailing list