[torqueusers] TM interface - MOM daemon on the node dies when tm_init is called

David Golden dgolden at cp.dias.ie
Mon Apr 10 08:17:12 MDT 2006


On 2006-04-08 15:13:50 -0400, Prakash Velayutham wrote:
> Hi,
> 
> I am using Torque-2.0.0p8 and Open MPI-1.0.1. Please note that I am trying 
> to do tm_init from the node that is assigned rank 1 by Open MPI (generally 
> MS gets rank 0). What I am actually noticing is that the MOM daemon on the 
> node with rank 1 actually dies when it reaches the tm_init() call in the 
> code. That was totally unexpected for me. 

Well, it's not good that the mom dies! No doubt you've unearthed a bug.

Note, however, that,  at least in OpenMPI 1.0.2,  the layer beneath 
(openrte) spawns a transient orted manager process on _every node_
in the job. [1]

The MPI processes are then children of that.  It's rather different to how
OSC mpiexec for pbs does things, and means that if there IS a 1-tm-client
limit, it is likely "already used" on all nodes in the job by the
orted manager daemon.  Still, the mom shouldn't crash :-)

BUT: Aren't you really mixing nonconsecutive layers here? Surely the thing
to do is work on expanding the OpenRTE abstraction layer that hides TM 
(i.e you should be patching the relevant MCA PLS and RAS components for TM that
"orted" presumably uses*), keeping openmpi targeting the openrte api only?

* [N.B. Only pretending to know what I'm talking about here, first I
really saw of OpenRTE was at a talk last friday]

David Golden


[1]
[from *second* node in an OpenMPI job]
root     21812  0.0  0.0  6580 1268 ?        Ss   Mar31   3:59 /usr/local/torque/sbin/pbs_mom -p
dgolden  21570  0.0  0.0 20288 2408 ?        Ss   15:01   0:00  \_ orted --no-daemonize --bootproxy 1 --name 0.0.2 --num_pro
dgolden  21571 99.6  0.3 566888 14160 ?      R    15:02   0:32      \_ IMB-MPI1.openmpi-icc9
dgolden  21572 95.8  0.3 567052 14384 ?      S    15:02   0:31      \_ IMB-MPI1.openmpi-icc9



More information about the torqueusers mailing list