[torqueusers] problem with shared libraries

Jan jand at uvic.ca
Mon Feb 11 10:31:46 MST 2008


Hi,

 > You are using evil $LD_LIBRARY_PATH that shows up in your interactive 
shell but
 > not in the non-interactive batch shell?

Yes, thanks! I can't believe I did not realize that. Anyway, that works 
now.

 > Which MPI implementation?  All MPI implementations need to know which 
machines
 > to use and how many processes to spawn.  Some, like openmpi, have PBS 
support
 > and can get that information directly.  Others, like MPICH, need to 
be told
 > with '-machinefile $PBS_NODEFILE -np X'.  OSC has an mpiexec that 
talks to PBS
 > and can launch MPICH jobs.

I am using openmpi 1.2.3 and torque 2.2.1. Now, in a submit script doing 
this:
#PBS -l nodes=2:ppn=8
/usr/local/bin/mpirun --prefix /usr/local ./fgs > ./test.log

starts the job correctly on both nodes, using a total of 16 cpus

However,
#PBS -l nodes=1:ppn=8
/usr/local/bin/mpirun --prefix /usr/local ./fgs > ./test.log

starts the job on the first node on one CPU. Any ideas?

Thank you very much, Jan



Garrick Staples wrote:
> On Sun, Feb 10, 2008 at 03:30:06PM -0800, Jan alleged:
>> Hi,
>>
>>  I am in the (slow) process of setting up my first cluster. So far, I 
>> have 2 machines with 8 cpus each (running ubuntu 7.10). One machine is a 
>> server and a node (node1) at once, the other one is a node (node2). 
>> pbsnodes -a reports both nodes as working. node1 /home is mounted via 
>> nfs onto node2 . When I look over the log files in 
>> /var/spool/torque/*_logs/ I cannot find anything obviously wrong.
>>
>> I compiled pbs, and installed it. I configured everything (setting the 
>> server name etc. on both machines) following the online documentation.
>>
>> Now I seem to have two problems:
>> 1) if I submit a script such as:
>> #PBS -l nodes=1:ppn=8
>> #PBS -l walltime=96:00:00
>> #PBS -j oe
>>
>> # change the current working directory to the directory where
>> # the executable file 'hello' can be found
>> cd $PBS_O_WORKDIR
>> echo $PBS_O_WORKDIR
>>
>> # run the executable file 'hello' using the qmpirun script
>> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
>>
>> everything works. The code runs on 8 CPUs and I get the expected results 
>> from my code.
>>
>> If I omit the "-np 8" the code only runs on one cpu. I did not expect 
>> that behaviour  since I specified ppn=8 above.
>> Any suggestions as to why ppn=8 does not work?
> 
> Which MPI implementation?  All MPI implementations need to know which machines
> to use and how many processes to spawn.  Some, like openmpi, have PBS support
> and can get that information directly.  Others, like MPICH, need to be told
> with '-machinefile $PBS_NODEFILE -np X'.  OSC has an mpiexec that talks to PBS
> and can launch MPICH jobs.
> 
> 
>> 2) if I submit
>> #PBS -l nodes=2:ppn=1
>> #PBS -l walltime=96:00:00
>> #PBS -j oe
>>
>> # change the current working directory to the directory where
>> # the executable file 'hello' can be found
>> cd $PBS_O_WORKDIR
>> echo $PBS_O_WORKDIR
>>
>> # run the executable file 'hello' using the qmpirun script
>> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
>>
>> qstat indicates that the job is running but the code is not being 
>> executed. If I qdel the job, the error file indicates that
>> a shared lib is missing:
>> fgs: error while loading shared libraries: libimf.so: cannot open shared 
>> object file: No such file or directory
>>
>> I assume that this happens on node2. However, if I log into the node and 
>> execute the job directly with mpirun, it runs as expected.
> 
> You are using evil $LD_LIBRARY_PATH that shows up in your interactive shell but
> not in the non-interactive batch shell?
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Jan Dettmer, Postdoctoral Fellow
School of Earth and Ocean Sciences, University of Victoria	
Victoria, BC V8W 3P6
office: (250) 472-4342	email: jand at uvic.ca
http://web.uvic.ca/~jand/


More information about the torqueusers mailing list