[torqueusers] LAM/MPI + Torque

Justin Finnerty justin.finnerty at uni-oldenburg.de
Fri Jun 29 03:42:45 MDT 2007


On Thu, 2007-06-28 at 17:41 -0700, SCIPIONI Roberto wrote:
> As far as I understand you need to tell Torque to boot the LAM properly
> inside the script

This is not necessary if you use mpiexec.  In our setting I configured
LAM/MPI (our current version is lam-7.1.3) to use the resource manager
interface.  This is an option to the LAM/MPI configure script when you
compile _LAM_.  Then LAM will talk to torque directly to get the nodes,
boot LAM and also distribute the jobs to the nodes directly (no need for
rsh/ssh.)

OLD command (note you need to manually specify number of nodes):

mpiexec -machinefile $PBS_NODEFILE -n ?? [your script]

NEW command:

mpiexec -boot [your script]


Apart from being simple it has two other big advantages.

* You get meaningful usage information from qstat etc,  without this a
running MPI job will appear to use no CPU.

* All processes stay under control of the queue system.  Manually
running LAM (OLD command) with rsh/ssh tends to leave orphaned LAM
daemons on nodes which have to be manually killed by the system
administrator logging into each node checking the daemon is unused and
then killing it.  I used to do this about once a week.  Fortunately the
pestat command is useful for detecting orphaned daemons as it lists the
number of job processes on each node and if you have more processes than
jobs currently running on a node then you usually have an orphan.



Currently we allow users to use either method, but I am only teaching
the new command to new users.

One point is that you need to use the mpiexec program that comes with
LAM.  There is another mpiexec program
(http://www.osc.edu/~pw/mpiexec/index.php) which provides similar
functionality for other MPI implementations but doesn't work with
LAM/MPI.

Cheers
Justin

-- 
Dr Justin Finnerty
Rm W3-1-218         Ph 49 (441) 798 3726
Carl von Ossietzky Universität Oldenburg



More information about the torqueusers mailing list