[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Wed Nov 23 04:59:15 MST 2005

On Nov 23, 2005, at 1:06 AM, Garrick Staples wrote:

> I think I need more specifics.  It's obvious to me that you know a lot
> more about using TM than I do.

Heh.  I'm not so sure about that!  :-)

>> --> Mark this one as only partially solved.  Yes, we can keep a TM
>> connection open for the duration of the MPI job, but you still can't
>> have that mpirun disconnect from a running job and still retain the
>> ability to tm_kill() any of the tm_spawned() processes later -- 
>> perhaps
>> even from something other than mpirun.  This would be *extremely*
>> useful (think "screen" for MPI jobs).
> I'm lost on this one.  If you have two different MPI processes 
> launching
> different sets of tasks, and they both exit and reattach, even if we
> supported retrieving the list of running tasks, how do the two mpiruns
> know which tasks belong to which MPI job?  How do the two mpiruns know
> which is itself?
> With 'screen', I can specify a pid, tty, or a named session.  But I
> can't think of an equivalent for mpirun.

Just like with screen, some kind of unique identifier will obviously be 
needed here.  Internally to mpirun, we have data representing all the 
jobs that are running in our universe (even across successive runs of 
mpirun -- there is a persistent storage of jobs that are currently 

Sidenote for some definitions:

- universe: the set of resources that we are currently allowed to 
launch processes on
- job: a set of 1 or more processes constituting a cohesive parallel 
application (e.g., all the processes in a single MPI_COMM_WORLD)

So to make screen-like capabilities work, mpirun and friends need 
persistent storage somewhere (a la screen's $TMPDIR directories).  In 
Open MPI, we have this (Open MPI has a sophisticated run-time layer, 
and is in an excellent position to use new, advanced TM capabilities).  
So mpirun (and friends) already have internal knowledge of all the jobs 
in the universe, and can generate and/or maintain unique identifiers 
for each.

Hence, if a) a job is launched with mpirun, b) mpirun detaches, and c) 
some other tool re-attaches later, the unique identifier(s) associated 
with any given job can be re-acquired and therefore used to lookup the 
information necessary to tm_kill() the set of processes in a job.

Sorry, I didn't explain this in my first mail.

>> 3. If you tm_spawn() something that fails to launch properly (e.g.,
>> file not found on the remote node), there is no error notification 
>> sent
>> back to the process that invoked tm_spawn().
> Hrm?
> $ pbsdsh -c 1 lkjahdsfljahdf
> pbsdsh: task 0 exit status 254
> $ pbsdsh -c 2 lkjahdsfljahdf
> pbsdsh: task 0 exit status 254
> pbsdsh: task 1 exit status 254

I was quite definitely running into this problem with our TM launcher 
in Open MPI.  I don't have time at the moment (am catching a flight in 
a few hours), but I'll try to replicate it next week and send more 

Given the example you just showed, let's not exclude the possibility 
that I was doing something wrong.  :-)

>> 4. It would also be nice to have a "group" spawn -- where mpirun can
>> issue a single tm_spawn() and have it launch multiple processes at
>> once.  Even something simple to handle the common SPMD case (e.g.,
>> "mpirun -np 1024 a.out") would be nice.  This pushes the scalability
>> issues down into TM.  True, you might simply do a simple linear loop,
>> but it at least allows for the *possibility* of a scalable launch
>> (where scalable = something better than linear).  With
>> uni-tm_spawn()'s, there is no possibility of anything better than
>> linear.
> Pete Wycoff and I were talking about this at SC05 last week.  We never
> came up with a decent interface that lets us specify different args/env
> for each task.

How about something similar to MPI_COMM_SPAWN_MULTIPLE?

You don't have to go that complicated, however.  Perhaps even just 
offering the capability of launching the *same* task multiple times 
would be sufficient.  I.e., don't try to handle the MPMD case -- just 
the simple SPMD case (with everyone having the same environment).  This 
is the vast majority of MPI jobs, anyway.

That being said, it would also be necessary for the launched processes 
to get some kind of environment variable indicating which task number 
they are (relative to the tm_spawn()).  For example, if I launch 10 
copies of a.out, each launched process should get some kind of 
environment variable indicating their identity in the 10 (i.e., 0 
through 9).

Specifically, my launched processes need to be able to discover their 
identity in some way.  For TM environments, we currently do this with a 
different argv command line for each executable (we launch a proxy that 
launches MPI processes -- so we can do whatever we want with argv).  
But in the SPMD case, we'll only be able to launch a single set of argv 
across all nodes, so the proxy needs to be able to determine its 
identity through a different mechanism -- e.g., an environment variable 
that only the launcher itself can provide.

Make sense?

>> Ok.  Can you initiate discussions with them?  Consider the TM
>> consumers' perspectives (including mine): we absolutely do not want N
>> different TM implementations out that there are different and have
>> different tests to establish how they're different.  That becomes a
> I'll talk to Dave about this.  He hangs out with commercial peeps.

Many thanks.

> Can you get all the MPI vendors to use the same launch protocol? :)

Well, it depends on what you mean by "launch protocol."

If you mean the underlying launch mechanism, then all MPI vendors who 
use TM *are* using the same API to start executables.  I certainly 
can't speak for other MPI vendors who do not use the TM API (if you 
want your MPI to use TM, then I suggest that you speak with your MPI 
vendor :-) ).

If you mean data that MPI's send across the wire to start up their 
processes, I'm not sure what would be gained by standardizing that.

But also consider TM is a lower layer service than MPI.  Hence, 
standardization is important and relevant to allow multiple different 
upper layers (e.g., different MPI implementations) to be able to use 
the lower layers (e.g., different TM implementations) consistently.

I know you're making a joke, but I thought I'd clarify.  :-)

{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

More information about the torqueusers mailing list