[torqueusers] problem with shared libraries

Garrick Staples garrick at usc.edu
Sun Feb 10 22:17:32 MST 2008


On Sun, Feb 10, 2008 at 03:30:06PM -0800, Jan alleged:
> Hi,
> 
>  I am in the (slow) process of setting up my first cluster. So far, I 
> have 2 machines with 8 cpus each (running ubuntu 7.10). One machine is a 
> server and a node (node1) at once, the other one is a node (node2). 
> pbsnodes -a reports both nodes as working. node1 /home is mounted via 
> nfs onto node2 . When I look over the log files in 
> /var/spool/torque/*_logs/ I cannot find anything obviously wrong.
> 
> I compiled pbs, and installed it. I configured everything (setting the 
> server name etc. on both machines) following the online documentation.
> 
> Now I seem to have two problems:
> 1) if I submit a script such as:
> #PBS -l nodes=1:ppn=8
> #PBS -l walltime=96:00:00
> #PBS -j oe
> 
> # change the current working directory to the directory where
> # the executable file 'hello' can be found
> cd $PBS_O_WORKDIR
> echo $PBS_O_WORKDIR
> 
> # run the executable file 'hello' using the qmpirun script
> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
> 
> everything works. The code runs on 8 CPUs and I get the expected results 
> from my code.
> 
> If I omit the "-np 8" the code only runs on one cpu. I did not expect 
> that behaviour  since I specified ppn=8 above.
> Any suggestions as to why ppn=8 does not work?

Which MPI implementation?  All MPI implementations need to know which machines
to use and how many processes to spawn.  Some, like openmpi, have PBS support
and can get that information directly.  Others, like MPICH, need to be told
with '-machinefile $PBS_NODEFILE -np X'.  OSC has an mpiexec that talks to PBS
and can launch MPICH jobs.


> 2) if I submit
> #PBS -l nodes=2:ppn=1
> #PBS -l walltime=96:00:00
> #PBS -j oe
> 
> # change the current working directory to the directory where
> # the executable file 'hello' can be found
> cd $PBS_O_WORKDIR
> echo $PBS_O_WORKDIR
> 
> # run the executable file 'hello' using the qmpirun script
> /usr/local/bin/mpirun -np 8 --prefix /usr/local ./fgs > ./test.log
> 
> qstat indicates that the job is running but the code is not being 
> executed. If I qdel the job, the error file indicates that
> a shared lib is missing:
> fgs: error while loading shared libraries: libimf.so: cannot open shared 
> object file: No such file or directory
> 
> I assume that this happens on node2. However, if I log into the node and 
> execute the job directly with mpirun, it runs as expected.

You are using evil $LD_LIBRARY_PATH that shows up in your interactive shell but
not in the non-interactive batch shell?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20080210/b2e0ebf0/attachment.bin


More information about the torqueusers mailing list