[torqueusers] TM improvements
Jeff Squyres
jsquyres at open-mpi.org
Thu Dec 1 17:49:25 MST 2005
On Nov 23, 2005, at 6:59 AM, Jeff Squyres wrote:
>>> 3. If you tm_spawn() something that fails to launch properly (e.g.,
>>> file not found on the remote node), there is no error notification
>>> sent
>>> back to the process that invoked tm_spawn().
>>
>> $ pbsdsh -c 1 lkjahdsfljahdf
>> pbsdsh: task 0 exit status 254
>>
>> $ pbsdsh -c 2 lkjahdsfljahdf
>> pbsdsh: task 0 exit status 254
>> pbsdsh: task 1 exit status 254
>
> I was quite definitely running into this problem with our TM launcher
> in Open MPI. I don't have time at the moment (am catching a flight in
> a few hours), but I'll try to replicate it next week and send more
> details.
SC and the US Thanksgiving holiday are over, and I've [somewhat]
managed to dig out from the mound of work that accumulated over the
two, so I am now able to return to attempting to replicate this error.
I think I see the disconnect between what I am saying and what you are
reporting. Specifically note that in your example, pbsdsh did *not*
report "executable not found" -- it just said that the task exited with
status 254.
With that in mind, here's a trivial program that shows what I'm saying
(you'll notice something wrong with it, but bear with me):
-----
#include <stdio.h>
#include <tm.h>
extern char **environ;
void do_check(int val, char *msg) {
if (TM_SUCCESS != val) {
printf("ret is %d instead of %d: %s\n", val, TM_SUCCESS, msg);
exit(1);
}
}
int main(int argc, char* argv[])
{
int ret, local_err;
struct tm_roots tm_root;
tm_event_t event;
tm_node_id node_id;
tm_task_id task_id;
/* attach */
ret = tm_init(NULL, &tm_root);
do_check(ret, "tm_init failed");
/* do a spawn */
node_id = 0;
task_id = 0xdeadbeef;
event = 0xdeadbeef;
ret = tm_spawn(argc - 1, argv + 1, environ, node_id, &task_id,
&event);
do_check(ret, "tm_spawn_failed");
printf("Got task id: %x, event %x\n", task_id, event);
/* poll for it */
event = 0xdeadbeef;
local_err = 0xdeadbeef;
ret = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
do_check(ret, "tm_poll failed");
printf("event is now: %x, local_err is %x\n", event, local_err);
/* detach */
tm_finalize();
return 0;
}
-----
Here's a sample compile and run:
-----
[18:17] eddie:~/tmp % gcc tm_test.c -I /opt/torque/include -L
/opt/torque/lib -lpbs -o tm_test && ./tm_test /bin/echo hello there
Got task id: deadbeef, event 2
hello there
event is now: 2, local_err is 0
[18:17] eddie:~/tmp %
-----
That's all fine and good, because I spawned "/bin/echo hello there".
Indeed, you can see the output of it interspersed in my text. But what
if we try something stupid?
-------
[18:17] eddie:~/tmp & ./tm_test totally_bogus_name
Got task id: deadbeef, event 2
event is now: 2, local_err is 0
[18:17] eddie:~/tmp %
-------
Nothing told me that the spawn failed. tm_spawn() returned TM_SUCCESS,
which makes perfect sense: I assume that tm_spawn() is simply sending a
message to a MOM telling it to spawn -- kind of an asynchronous version
of fork() and exec() combined into one, right? -- and the sending of
the message was successful. tm_poll() then tells me that event 2 has
finished (i.e., the spawn event) and returned TM_SUCCESS as well, which
also makes sense because the poll itself was successful. *But
local_err was returned as 0.* To me, that indicates that the spawn was
successful.
Indeed, looking carefully at the PSCHED API document, it makes the
following statements:
- For tm_poll: (p5, starting line 21) "If a delayed error has occurred
associated with the returned event, tm_errno is set to the error code."
- For tm_spawn(): (p7, starting line 5) "Following the tm_poll() call
that returns the event, tid will contain the task Id of the spawned
child. In case of a delayed error, tid will be set to TM_NULL_TASK,
and tm_errno will contain a non-zero error code."
This is why I assumed that when tm_poll() indicated that the event
associated with tm_spawn() finished and I got local_err == 0 (referred
to as tm_errno in the PSCHED API document), then there were no delayed
errors, and the process must have been successfully launched.
Unfortunately, this is apparently not what PBS/Torque is indicating. I
guess it comes down to what, exactly, is Torque's definition of a
delayed error...?
In my simple example above, I am not waiting for the death of the
spawned task -- I am firing and forgetting. This is because of the
other problems that I mentioned -- specifically, until the patch a
little while ago, we could not keep a tm connection open. Hence, we
would launch the process, poll once to check for error with the spawn,
and then tm_finalize() so that other processes could talk to the MOM.
Because of your example, I just went an examined the source of pbsdsh.
I see that spawn errors are only reported when the task "dies" -- i.e.,
pbsdsh is notified through the tm_obit() / tm_poll() mechanism. Hence,
we get the status 254 from the task death.
This is not what I expected -- indeed, it seems somewhat weird to me.
Here's why:
1. tm_obit() is supposed to notify you when a task dies. But if it was
never successfully launched in the first place, how could it die?
2. MPI makes the distinction in MPI_COMM_SPAWN between failing to
launch the desired application and dying after the launch.
Specifically, if the desired application fails to launch,
MPI_COMM_SPAWN will return the error, not something else later.
3. Unix fork()/exec() semantics are similar to #2 (indeed, the
COMM_SPAWN semantics were at least partially inspired by fork()/exec()
semantcs). If fork() fails, you find out from the return of fork() --
not by calling wait() to see what happened to the child. And if exec()
fails, you find out from the return of exec(), not by launching a bogus
process that immediately returns a status of 254. Granted, fork() and
exec() are synchronous, but if you extrapolate and make their
terminations subject to some kind of polling mechanism, I would expect
them to report their failures directly (e.g., when I poll for
completion of fork() and/or exec()).
#2 and #3 are not directly applicable to TM, of course, but they
contributed to my thinking process / bias. You may not agree with my
rationale :-), but I just want to explain my mindset and why I am
surprised by Torque's TM behavior.
Also, calling tm_obit() and then looping over tm_poll() to check for
whether Open MPI was able to correctly spawn the process is not
feasible for us. How long should we loop over tm_poll()? It's always
a race condition -- consider if we were tm_spawn()ing on a slow node;
it could take a long time for tm_poll() to indicate that the task has
"died." How long, exactly, should Open MPI poll before calling
tm_finalize() to allow some other process to talk to the MOM? The
obituary could always come in after our polling has completed and we
have finalized. Remember that in our model, the desirable / "correct"
case (i.e., when TM is able to successfully spawn a process) will have
no obituaries arrive.
More specifically, I always assumed that spawning was a finite/bounded
event. Although spawning could be "slow," we will *always* be
eventually told of the success or failure of a spawn. Hence, there is
a distinct difference between when a spawn fails and when a
successfully-spawned process dies.
I'm hammering on this point because it's one thing to tell a user "your
processes failed to launch"; it's quite a different thing to tell a
user "your process died after it was launched." In the former case,
I'd start looking at spelling of the executable name, location of
executables, PATH and LD_LIBRARY_PATH, etc. In the latter case, I'd
look at the code to see what happened after I hit main(), look for core
files, etc. Right now, it's impossible for an MPI implementation to
tell the difference, and that's bad for users (and we, the MPI
implementors, get blamed for making the user search their code for
hours when, in reality, there was a typo on the mpirun command line
such that the executable name was misspelled :-) ).
So -- after all my explanations -- I'd like to ask if Torque's TM
implementation can be amended to include "unable to launch" in the
definition of "delayed error" for tm_spawn(). This would mean that
tm_poll() would return an error when the tm_spawn() task completed, and
consumers of TM (like MPI) would be able to differentiate between
failing to launch and failing after launch. Indeed, it would be really
great if Torque could provide a somewhat-fine-grained set of error
codes to differentiate between different reasons as to *why* something
failed to launch (couldn't find the executable, exec() failed, out of
resources, etc.).
If you're still reading this extremely lengthy mail, thanks for your
time. :-)
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
More information about the torqueusers
mailing list