[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Tue Nov 22 20:50:33 MST 2005


On Nov 22, 2005, at 7:29 PM, Garrick Staples wrote:

> Did you test the patch yet? :)

I'm afraid not.  :-)

I might be able to, but not for another week or two (especially with 
the short holiday week -- I'll actually be offline until next week).

>> We actually have a few more issues with the TM interface that I have
>> passed on to Altair that would significantly help us support TM-based
>> systems better; is there any interest here to see our list posted 
>> here?
>>  (Altair has done some improvements to the TM interface in PBS Pro,
>> which is why I passed our list to them)
>
> Absolutely.  We definitely want to help the TM users out there.

Here's what I sent to Altair:

1. When you tm_init(), it opens a connection to the local MOM.  
However, the MOM only allows one TM connection at a time.  This means 
that clients essentially have to tm_init(), do what they're going to 
do, and then tm_finalize().  To be more specific, you have to *try* to 
tm_init() -- if it fails, loop a few times on the chance that some 
other process on your node was hogging the MOM connection at the time.  
It would be tremendously better if the MOM could accept multiple 
simultaneous TM connections.

--> Mark this one as solved; I only included it here for completeness 
because I sent it to Altair.

2. #1 is a major problem because it means that mpriun can't hold a TM 
connection for the the duration of the MPI job.  I.e., mpirun has to 
tm_init(), do a bunch of tm_spawn()s, and then tm_finalize().  However, 
this is a problem because it means mpirun can't tm_kill() anything if 
there's a problem.  Hence, mpirun has to do something like this:

	loop_until_tm_init();
	for (all processes)
		tm_spawn()
	tm_finalize();

(doing this specifically because of #1, because who knows if another 
mpirun will need to run on the same node at the same time)

So how can mpirun tell if processes die?  Since we can't keep a TM 
connection the whole time, one possibility is to do something like:

	while (job not done)
		sleep(1)
		loop_until_tm_init()
		...check for process deaths...
		tm_finalize()

And also install a ctrl-c catcher such that if the user hits ctrl-c, 
mpirun will use tm_kill() to kill the processes.

But neither of these are possible because at the first tm_finalize(), 
the local TM library has forgotten everything about all processes that 
it has launched.

--> Mark this one as only partially solved.  Yes, we can keep a TM 
connection open for the duration of the MPI job, but you still can't 
have that mpirun disconnect from a running job and still retain the 
ability to tm_kill() any of the tm_spawned() processes later -- perhaps 
even from something other than mpirun.  This would be *extremely* 
useful (think "screen" for MPI jobs).

3. If you tm_spawn() something that fails to launch properly (e.g., 
file not found on the remote node), there is no error notification sent 
back to the process that invoked tm_spawn().

4. It would also be nice to have a "group" spawn -- where mpirun can 
issue a single tm_spawn() and have it launch multiple processes at 
once.  Even something simple to handle the common SPMD case (e.g., 
"mpirun -np 1024 a.out") would be nice.  This pushes the scalability 
issues down into TM.  True, you might simply do a simple linear loop, 
but it at least allows for the *possibility* of a scalable launch 
(where scalable = something better than linear).  With 
uni-tm_spawn()'s, there is no possibility of anything better than 
linear.

>> I see the TM_MULTIPLE_CONNS #define in the path from yesterday; I
>> assume that this is exactly for this purpose (so that my configure
>> script can figure out that a given version of Torque supports the
>> multiple TM connection behavior).  That's actually quite perfect,
>
> The idea was certainly for compile-time feature inspection.  I can't 
> say
> I gave that aspect of the patch enough thought, but I figured it was
> enough to let people get started with testing the patch.

It *is* good.  But it is *best* if all the TM implementations agree on 
it.

>> except for cross-compiling situations (which I don't see as a problem
>> -- I'm not aware of anywhere that we cross-compile for TM support).
>
> *shrug*  I'm open to solutions.

I'm not worried about cross-compilation.  I mentioned it for 
completeness.

>> As a consumer of the TM interface, it would be *really great* if there
>> was only *one* set of these things to check against.  If we have to
>> splinter our configure script to check for different vendors and
>> different variants, it will be a complete and total nightmare (well,
>> more than the nightmare that our configure script already is! ;-) ).
>
> I really can't comment on what other PBS implementations are doing.  I
> don't have access to their commercial software, nor would I want to
> cause any misunderstandings.  To be honest, I have no idea what kind of
> feature-parity we have PBSpro, SGE, etc.  I'm really just focusing on
> TORQUE at this time.

Ok.  Can you initiate discussions with them?  Consider the TM 
consumers' perspectives (including mine): we absolutely do not want N 
different TM implementations out that there are different and have 
different tests to establish how they're different.  That becomes a 
nightmare for us, the result of which is that we might simply end up 
supporting the least common denominator (i.e., what exists today -- 
fairly sub-optimal implementations that don't take advantage of newer 
features because it becomes to logistically difficult to be truly 
portable).

I absolutely do not want this to happen; I would much rather be able to 
provide a full featured mpirun (etc.) in TM-based environments.  But 
that does depend on having more-or-less a uniform interface to TM and 
more-or-less uniform capabilities (or at least uniform ways for testing 
for these capabilities).  Otherwise, they fragment into different 
systems that happen have the same name (and we all become confused).

I hope this doesn't come across as whining -- I don't intend it that 
way at all.  I just want a nice, [mostly] uniform interface where I can 
have one code base that supports all TM vendors.

> But I'm certainly open to maintaining compatibility if someone
> contributes the knowledge or patches.
>
> TM has a POSIX specification that I don't want to _break_, but I don't
> mind extending.

I was unaware of that -- can you provide the specific citation?  I've 
only ever read the PSCHED API document, the tm man page in PBS/Torque, 
and the implementation code in PBS/Torque.

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



More information about the torqueusers mailing list