[torqueusers] TM improvements
Jeff Squyres
jsquyres at open-mpi.org
Tue Dec 6 08:12:24 MST 2005
On Dec 2, 2005, at 9:21 PM, Garrick Staples wrote:
>> I think I see the disconnect between what I am saying and what you
>> are
>> reporting. Specifically note that in your example, pbsdsh did *not*
>> report "executable not found" -- it just said that the task exited
>> with
>> status 254.
>
> But that is precisely what 254 means. It means the final exec()
> failed.
> You can poll for that.
Gotcha. What exactly is 254? Is it an Exxxx errno code that I can
compare to? If not, is there a documented list of the codes that I
can compare against?
>> 1. tm_obit() is supposed to notify you when a task dies. But if
>> it was
>> never successfully launched in the first place, how could it die?
>
> But the task was successfully launched!
>
> The "task" isn't the process you are executing. It is all of the
> moving
> parts in MOM required to manage remote processes.
Ok, I can see your point here -- I did not make that distinction when
I read the documentation (and the point is not explicitly made
anywhere I can see).
>> 2. MPI makes the distinction in MPI_COMM_SPAWN between failing to
>> launch the desired application and dying after the launch.
>> Specifically, if the desired application fails to launch,
>> MPI_COMM_SPAWN will return the error, not something else later.
>
> This is because MPI has the convienence of having a feedback
> mechanism.
> When you have child processes telling you they've started, it makes
> it easy to know they were exec()'d.
Although I think you're referring to the MPI_INIT rendezvous, a) note
that not all environments use a rendezvous during MPI_INIT, and b) TM
has a feedback mechanism, too -- your MOM communications (it's not
the same as the MPI_INIT rendezvous, but it *is* a feedback system).
Ok, I'm splitting hairs here :-) -- my main point is that not all
environments require a rendezvous during MPI_INIT. So we can't use
that as guarantee to know that processes have launched.
As such, in Open MPI, we do not rely on the rendezvous during
MPI_INIT to determine whether a process has launched or not. We rely
on the underlying launcher system to tell us. Also keep in mind that
we distinguish between failure to launch a process and failure after
the process has started (to include failing to rendezvous during
MPI_INIT).
My point here is that a process-launching system should be able to
tell you in a finite/bounded time whether something successfully
launched or not (i.e., not be subject to arbitrary timeouts). See
below.
>> 3. Unix fork()/exec() semantics are similar to #2 (indeed, the
>> COMM_SPAWN semantics were at least partially inspired by fork()/
>> exec()
>> semantcs). If fork() fails, you find out from the return of fork
>> () --
>> not by calling wait() to see what happened to the child. And if
>> exec()
>> fails, you find out from the return of exec(), not by launching a
>> bogus
>> process that immediately returns a status of 254. Granted, fork()
>> and
>> exec() are synchronous, but if you extrapolate and make their
>> terminations subject to some kind of polling mechanism, I would
>> expect
>> them to report their failures directly (e.g., when I poll for
>> completion of fork() and/or exec()).
>
> Not really. How does the parent process ever know the child passed
> through the exec()?
I guess this is where our disconnect is -- there are ways to do this.
For example -- use a close-on-exec pipe. The parent can block on a
pipe after the fork() -- if it closes, the exec() succeeded. If the
child's exec() fails, it can send a message back up the pipe saying
"help, I failed!" This is not 100% foolproof, because at some point
during exec(), the pipe will close but exec() could still fail, but
it usually covers many common cases of failure (e.g., file not found,
access denied, etc.).
A notable case that this trick does *not* handle, though, is missing
shared libraries (because the exec() succeeds and the process starts
-- but then quickly aborts). But:
a) Per text later in your reply, the stderr that the linker outputs
will be displayed, and all is good (i.e., the user is properly
notified of the correct error).
b) Also, from my point of view, this is perfectly acceptable -- all I
want to know is whether the process *launched*. If something
happened *after* that, that's a different class of error as far as
I'm concerned.
Here's a trivial program that I just hacked up to show this trick
(pardon the lack of error checking -- it's a proof-of-concept
intended for example only):
-----
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
char *to_exec[] = { "/bin/echo", "hello there", NULL };
int main(int argc, char* argv) {
pid_t pid;
int fd[2];
int ret;
/* avoiding most error checking for the sake of this example */
pipe(fd);
pid = fork();
if (pid == 0) {
/* child */
/* close the reading end */
close(fd[0]);
/* Set the writing end to be close-on-exec */
fcntl(fd[1], F_SETFD, FD_CLOEXEC);
/* sync with parent -- indicating "ready!" */
write(fd[1], &ret, sizeof(int));
/* do the exec */
ret = execv(to_exec[0], to_exec);
/* doh! -- an error occurred */
write(fd[1], &ret, 1);
exit(1);
} else {
/* parent */
/* close the writing end */
close(fd[1]);
/* wait for child to sync */
read(fd[0], &ret, sizeof(int));
/* Now wait for pipe to either close or indicate an error */
while (1) {
ret = read(fd[0], &ret, sizeof(int));
if (ret < 0) {
if (errno == EINTR) {
continue;
}
perror("error in parent read");
exit(1);
} else if (0 == ret) {
printf("Got eof from child -- must be a successful
launch\n");
exit(0);
} else {
printf("Got non-eof from child -- error %d\n", ret);
exit(1);
}
}
}
/* never gets here */
}
-----
Try playing with different values in to_exec[] and you'll see what I
mean.
>> Also, calling tm_obit() and then looping over tm_poll() to check for
>> whether Open MPI was able to correctly spawn the process is not
>> feasible for us. How long should we loop over tm_poll()? It's
>> always
>> a race condition -- consider if we were tm_spawn()ing on a slow node;
>> it could take a long time for tm_poll() to indicate that the task has
>> "died." How long, exactly, should Open MPI poll before calling
>> tm_finalize() to allow some other process to talk to the MOM? The
>
> How long does mpirun wait? Clearly there must be a reasonable
> timeout.
No, that's not what I'm saying. mpirun (at least in Open MPI) is a
sequential set of clearly-defined phases. First we start processes
(and get a success/fail for each of them), then the rendezvous during
MPI_INIT occurs (but only if necessary, and mpirun itself may or may
not have visibility on it -- the rendezvous may be handled by an
independent agent of which mpirun is not a part), then mpirun waits
for processes to MPI_FINALIZE and/or be notified by the launcher that
the processes have died.
My point is that the first phase needs to be bounded -- but not by an
arbitrary timeout (because arbitrary timeouts can always fail for
unforeseen "slow" scenarios). If the first phase is not bounded,
then I'll have to mix phases and make a generalized progression
engine, which I *really* do not want to do (it would have to be quite
complex).
> Have you tried this a 2.0.0 TORQUE? Problems after the exec() are
> reported to the job's stderr:
>
> (on a 4 proc job)
> $ pbsdsh ljkhadf
> PBS: ljkhadf: No such file or directory
This is a big improvement -- thanks!
What does PBS Pro do here? I.e., can we rely on the same behavior
from them? (back to my "TM consumers need a more-or-less single set
of behaviors to code to/rely on" mantra)
>> So -- after all my explanations -- I'd like to ask if Torque's TM
>> implementation can be amended to include "unable to launch" in the
>> definition of "delayed error" for tm_spawn(). This would mean that
>> tm_poll() would return an error when the tm_spawn() task
>> completed, and
>> consumers of TM (like MPI) would be able to differentiate between
>> failing to launch and failing after launch. Indeed, it would be
>> really
>> great if Torque could provide a somewhat-fine-grained set of error
>> codes to differentiate between different reasons as to *why*
>> something
>> failed to launch (couldn't find the executable, exec() failed, out of
>> resources, etc.).
>
> I think we have that now. You just need to call tm_obit().
Well, that's not what exactly I was asking for (it still has the
timeout problem). :-)
Specifically, in accordance with the PSCHED API document, I was
asking for the delayed error to be reported in the return from tm_poll
() from the spawn. With a trick like the close-on-exec pipe, this is
certainly possible.
Per your definitions, I guess I'm asking for "delayed error" to
include "task immediately failed due to process failing to start."
> Yes, but have you tested my patch? :)
Our development cluster is still running Torque 1.2.something... :-
( I could swear that I had filed a ticket to get it upgraded, but
apparently I didn't. So I filed one over this past weekend with our
sysadmins to upgrade it to 2.0 + your patch.
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
More information about the torqueusers
mailing list