[torqueusers] mpi libraries not being loaded with torque

Adams, Samuel D Contr AFRL/HEDR Samuel.Adams at BROOKS.AF.MIL
Mon Sep 10 15:25:07 MDT 2007


I am trying to make my new cluster flexible such that it can run with
more than one configuration at the same time.  For example, you can
choose gcc, pg, or Intel compilers using OpenMPI.  To start out with, I
am just using gcc 4.1 that comes with RHEL5 and OpenMPI.  For some
reason, I am having trouble with the way it is loading the libraries
depending on how I run the job.  Basically it would seem that the
LD_LIBRARY_PATH is not set properly depending one how I run the job; it
works interactively but not with torque.  

I have this set in my .bashrc file in the root of my home directory

if [ `hostname | grep "prod"` ]; then
        PATH=/usr/local/profiles/gcc-openmpi/bin/:$PATH
 
LD_LIBRARY_PATH=/usr/local/profiles/gcc-openmpi/lib/:$LD_LIBRARY_PATH
fi

So, theroretically this should set the PATH and LD_LIBRARY_PATH properly
whenever I open a shell.

First I tired to submit a job with torque with a script something like
this:

!/bin/bash
#PBS -l nodes=2:ppn=8

`which mpirun` --prefix /usr/local/profiles/gcc-openmpi/ program_to_run
exit 0

As you can see, I tried everything I could think of to get around it not
finding the libraries, but it was to no avail.  This is the error I
invariably get:

[sam at prodnode1 fdtd_0.3]$ cat script.sh.e223
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory
/home/sam/code/fdtd/fdtd_0.3/fdtd: error while loading shared libraries:
libmpi.so.0: cannot open shared object file: No such file or directory

As a test, I ran:

[sam at prodnode1 fdtd_0.3]$ echo "echo $LD_LIBRARY_PATH" | qsub

And I got, which seems to be what I would expect:

[sam at prodnode1 fdtd_0.3]$ cat STDIN.o226
/usr/local/profiles/gcc-openmpi/lib/:


If I ran it by hand (interactively), it seemed to work ok.  Any ideas as
what I can to make these login scripts setup the environment run
seamlessly?

[sam at prodnode1 fdtd_0.3]$ `which mpirun` --host prodnode2,prodnode3 -np
16 --prefix /usr/local/profiles/gcc-openmpi/
/home/sam/code/fdtd/fdtd_0.3/fdtd -t
/home/sam/code/fdtd/fdtd_0.3/test_files/tissue.txt -r
/home/sam/code/fdtd/fdtd_0.3/test_files/sphere_brain_10_pad_x0120y0120z0
120.raw -v -f 500 --pw 90,0,1,0 -l test_log.out -a 10 --prefix job_8
Beowulf Computer Cluster (BCC)
AFRL/HED

This is a Department of Defense Computer System. This computer system,
includingall related equipment, networks, and network devices
(specifically including
Internet access) are provided only for authorized U.S. Government use.

DoD computer systems may be monitored for all lawful purposes, including
to
ensure that their use is authorized, for management of the system, to
facilitateprotection against unauthorized access, and to verify security
procedures,
survivability, and operational security. Monitoring includes active
attacks by
authorized DoD entities to test or verify the security of this system.
During
monitoring, information may be examined, recorded, copied and used for
authorized purposes. All information, including personal information,
placed or sent over this system may be monitored.
Beowulf Computer Cluster (BCC)
AFRL/HED

This is a Department of Defense Computer System. This computer system,
includingall related equipment, networks, and network devices
(specifically including
Internet access) are provided only for authorized U.S. Government use.

DoD computer systems may be monitored for all lawful purposes, including
to
ensure that their use is authorized, for management of the system, to
facilitateprotection against unauthorized access, and to verify security
procedures,
survivability, and operational security. Monitoring includes active
attacks by
authorized DoD entities to test or verify the security of this system.
During
monitoring, information may be examined, recorded, copied and used for
authorized purposes. All information, including personal information,
placed or sent over this system may be monitored.

 * Initializing FDTD            [ OK ]
 * Allocating memory            [ OK ]
 * Initializing PML             [ OK ]
 * Starting updates
 * halfcycle 1   ratio 0.0000   time 52.72s
 * halfcycle 2   ratio 5.4387   time 51.87s
...

Sam Adams
General Dynamics Information Technology
Phone: 210.536.5945



More information about the torqueusers mailing list