[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Thu Dec 1 17:49:25 MST 2005


On Nov 23, 2005, at 6:59 AM, Jeff Squyres wrote:

>>> 3. If you tm_spawn() something that fails to launch properly (e.g.,
>>> file not found on the remote node), there is no error notification 
>>> sent
>>> back to the process that invoked tm_spawn().
>>
>> $ pbsdsh -c 1 lkjahdsfljahdf
>> pbsdsh: task 0 exit status 254
>>
>> $ pbsdsh -c 2 lkjahdsfljahdf
>> pbsdsh: task 0 exit status 254
>> pbsdsh: task 1 exit status 254
>
> I was quite definitely running into this problem with our TM launcher 
> in Open MPI.  I don't have time at the moment (am catching a flight in 
> a few hours), but I'll try to replicate it next week and send more 
> details.

SC and the US Thanksgiving holiday are over, and I've [somewhat] 
managed to dig out from the mound of work that accumulated over the 
two, so I am now able to return to attempting to replicate this error.

I think I see the disconnect between what I am saying and what you are 
reporting.  Specifically note that in your example, pbsdsh did *not* 
report "executable not found" -- it just said that the task exited with 
status 254.

With that in mind, here's a trivial program that shows what I'm saying 
(you'll notice something wrong with it, but bear with me):

-----
#include <stdio.h>
#include <tm.h>

extern char **environ;

void do_check(int val, char *msg) {
     if (TM_SUCCESS != val) {
         printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg);
         exit(1);
     }
}

int main(int argc, char* argv[])
{
     int ret, local_err;
     struct tm_roots tm_root;
     tm_event_t event;
     tm_node_id node_id;
     tm_task_id task_id;

     /* attach */
     ret = tm_init(NULL, &tm_root);
     do_check(ret, "tm_init failed");

     /* do a spawn */
     node_id = 0;
     task_id = 0xdeadbeef;
     event = 0xdeadbeef;
     ret = tm_spawn(argc - 1, argv + 1, environ, node_id, &task_id, 
&event);
     do_check(ret, "tm_spawn_failed");
     printf("Got task id: %x, event %x\n", task_id, event);

     /* poll for it */
     event = 0xdeadbeef;
     local_err = 0xdeadbeef;
     ret = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
     do_check(ret, "tm_poll failed");
     printf("event is now: %x, local_err is %x\n", event, local_err);

     /* detach */
     tm_finalize();

     return 0;
}
-----

Here's a sample compile and run:

-----
[18:17] eddie:~/tmp % gcc tm_test.c -I /opt/torque/include -L 
/opt/torque/lib -lpbs -o tm_test && ./tm_test /bin/echo hello there
Got task id: deadbeef, event 2
hello there
event is now: 2, local_err is 0
[18:17] eddie:~/tmp %
-----

That's all fine and good, because I spawned "/bin/echo hello there".  
Indeed, you can see the output of it interspersed in my text.  But what 
if we try something stupid?

-------
[18:17] eddie:~/tmp & ./tm_test totally_bogus_name
Got task id: deadbeef, event 2
event is now: 2, local_err is 0
[18:17] eddie:~/tmp %
-------

Nothing told me that the spawn failed.  tm_spawn() returned TM_SUCCESS, 
which makes perfect sense: I assume that tm_spawn() is simply sending a 
message to a MOM telling it to spawn -- kind of an asynchronous version 
of fork() and exec() combined into one, right? -- and the sending of 
the message was successful.  tm_poll() then tells me that event 2 has 
finished (i.e., the spawn event) and returned TM_SUCCESS as well, which 
also makes sense because the poll itself was successful.  *But 
local_err was returned as 0.*  To me, that indicates that the spawn was 
successful.

Indeed, looking carefully at the PSCHED API document, it makes the 
following statements:

- For tm_poll: (p5, starting line 21) "If a delayed error has occurred 
associated with the returned event, tm_errno is set to the error code."

- For tm_spawn(): (p7, starting line 5) "Following the tm_poll() call 
that returns the event, tid will contain the task Id of the spawned 
child.  In case of a delayed error, tid will be set to TM_NULL_TASK, 
and tm_errno will contain a non-zero error code."

This is why I assumed that when tm_poll() indicated that the event 
associated with tm_spawn() finished and I got local_err == 0 (referred 
to as tm_errno in the PSCHED API document), then there were no delayed 
errors, and the process must have been successfully launched.  
Unfortunately, this is apparently not what PBS/Torque is indicating.  I 
guess it comes down to what, exactly, is Torque's definition of a 
delayed error...?

In my simple example above, I am not waiting for the death of the 
spawned task -- I am firing and forgetting.  This is because of the 
other problems that I mentioned -- specifically, until the patch a 
little while ago, we could not keep a tm connection open.  Hence, we 
would launch the process, poll once to check for error with the spawn, 
and then tm_finalize() so that other processes could talk to the MOM.

Because of your example, I just went an examined the source of pbsdsh.  
I see that spawn errors are only reported when the task "dies" -- i.e., 
pbsdsh is notified through the tm_obit() / tm_poll() mechanism.  Hence, 
we get the status 254 from the task death.

This is not what I expected -- indeed, it seems somewhat weird to me.   
  Here's why:

1. tm_obit() is supposed to notify you when a task dies.  But if it was 
never successfully launched in the first place, how could it die?

2. MPI makes the distinction in MPI_COMM_SPAWN between failing to 
launch the desired application and dying after the launch.  
Specifically, if the desired application fails to launch, 
MPI_COMM_SPAWN will return the error, not something else later.

3. Unix fork()/exec() semantics are similar to #2 (indeed, the 
COMM_SPAWN semantics were at least partially inspired by fork()/exec() 
semantcs).  If fork() fails, you find out from the return of fork() -- 
not by calling wait() to see what happened to the child.  And if exec() 
fails, you find out from the return of exec(), not by launching a bogus 
process that immediately returns a status of 254.  Granted, fork() and 
exec() are synchronous, but if you extrapolate and make their 
terminations subject to some kind of polling mechanism, I would expect 
them to report their failures directly (e.g., when I poll for 
completion of fork() and/or exec()).

#2 and #3 are not directly applicable to TM, of course, but they 
contributed to my thinking process / bias.  You may not agree with my 
rationale :-), but I just want to explain my mindset and why I am 
surprised by Torque's TM behavior.

Also, calling tm_obit() and then looping over tm_poll() to check for 
whether Open MPI was able to correctly spawn the process is not 
feasible for us.  How long should we loop over tm_poll()?  It's always 
a race condition -- consider if we were tm_spawn()ing on a slow node; 
it could take a long time for tm_poll() to indicate that the task has 
"died."  How long, exactly, should Open MPI poll before calling 
tm_finalize() to allow some other process to talk to the MOM?  The 
obituary could always come in after our polling has completed and we 
have finalized.  Remember that in our model, the desirable / "correct" 
case (i.e., when TM is able to successfully spawn a process) will have 
no obituaries arrive.

More specifically, I always assumed that spawning was a finite/bounded 
event.  Although spawning could be "slow," we will *always* be 
eventually told of the success or failure of a spawn.  Hence, there is 
a distinct difference between when a spawn fails and when a 
successfully-spawned process dies.

I'm hammering on this point because it's one thing to tell a user "your 
processes failed to launch"; it's quite a different thing to tell a 
user "your process died after it was launched."  In the former case, 
I'd start looking at spelling of the executable name, location of 
executables, PATH and LD_LIBRARY_PATH, etc.  In the latter case, I'd 
look at the code to see what happened after I hit main(), look for core 
files, etc.  Right now, it's impossible for an MPI implementation to 
tell the difference, and that's bad for users (and we, the MPI 
implementors, get blamed for making the user search their code for 
hours when, in reality, there was a typo on the mpirun command line 
such that the executable name was misspelled :-) ).

So -- after all my explanations -- I'd like to ask if Torque's TM 
implementation can be amended to include "unable to launch" in the 
definition of "delayed error" for tm_spawn().  This would mean that 
tm_poll() would return an error when the tm_spawn() task completed, and 
consumers of TM (like MPI) would be able to differentiate between 
failing to launch and failing after launch.  Indeed, it would be really 
great if Torque could provide a somewhat-fine-grained set of error 
codes to differentiate between different reasons as to *why* something 
failed to launch (couldn't find the executable, exec() failed, out of 
resources, etc.).

If you're still reading this extremely lengthy mail, thanks for your 
time.  :-)

-- 
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



More information about the torqueusers mailing list