[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Tue Dec 6 08:12:24 MST 2005

On Dec 2, 2005, at 9:21 PM, Garrick Staples wrote:

>> I think I see the disconnect between what I am saying and what you  
>> are
>> reporting.  Specifically note that in your example, pbsdsh did *not*
>> report "executable not found" -- it just said that the task exited  
>> with
>> status 254.
> But that is precisely what 254 means.  It means the final exec()  
> failed.
> You can poll for that.

Gotcha.  What exactly is 254?  Is it an Exxxx errno code that I can  
compare to?  If not, is there a documented list of the codes that I  
can compare against?

>> 1. tm_obit() is supposed to notify you when a task dies.  But if  
>> it was
>> never successfully launched in the first place, how could it die?
> But the task was successfully launched!
> The "task" isn't the process you are executing.  It is all of the  
> moving
> parts in MOM required to manage remote processes.

Ok, I can see your point here -- I did not make that distinction when  
I read the documentation (and the point is not explicitly made  
anywhere I can see).

>> 2. MPI makes the distinction in MPI_COMM_SPAWN between failing to
>> launch the desired application and dying after the launch.
>> Specifically, if the desired application fails to launch,
>> MPI_COMM_SPAWN will return the error, not something else later.
> This is because MPI has the convienence of having a feedback  
> mechanism.
> When you have child processes telling you they've started, it makes
> it easy to know  they were exec()'d.

Although I think you're referring to the MPI_INIT rendezvous, a) note  
that not all environments use a rendezvous during MPI_INIT, and b) TM  
has a feedback mechanism, too -- your MOM communications (it's not  
the same as the MPI_INIT rendezvous, but it *is* a feedback system).   
Ok, I'm splitting hairs here :-) -- my main point is that not all  
environments require a rendezvous during MPI_INIT.  So we can't use  
that as guarantee to know that processes have launched.

As such, in Open MPI, we do not rely on the rendezvous during  
MPI_INIT to determine whether a process has launched or not.  We rely  
on the underlying launcher system to tell us.  Also keep in mind that  
we distinguish between failure to launch a process and failure after  
the process has started (to include failing to rendezvous during  

My point here is that a process-launching system should be able to  
tell you in a finite/bounded time whether something successfully  
launched or not (i.e., not be subject to arbitrary timeouts).  See  

>> 3. Unix fork()/exec() semantics are similar to #2 (indeed, the
>> COMM_SPAWN semantics were at least partially inspired by fork()/ 
>> exec()
>> semantcs).  If fork() fails, you find out from the return of fork 
>> () --
>> not by calling wait() to see what happened to the child.  And if  
>> exec()
>> fails, you find out from the return of exec(), not by launching a  
>> bogus
>> process that immediately returns a status of 254.  Granted, fork()  
>> and
>> exec() are synchronous, but if you extrapolate and make their
>> terminations subject to some kind of polling mechanism, I would  
>> expect
>> them to report their failures directly (e.g., when I poll for
>> completion of fork() and/or exec()).
> Not really.  How does the parent process ever know the child passed
> through the exec()?

I guess this is where our disconnect is -- there are ways to do this.

For example -- use a close-on-exec pipe.  The parent can block on a  
pipe after the fork() -- if it closes, the exec() succeeded.  If the  
child's exec() fails, it can send a message back up the pipe saying  
"help, I failed!"  This is not 100% foolproof, because at some point  
during exec(), the pipe will close but exec() could still fail, but  
it usually covers many common cases of failure (e.g., file not found,  
access denied, etc.).

A notable case that this trick does *not* handle, though, is missing  
shared libraries (because the exec() succeeds and the process starts  
-- but then quickly aborts).  But:

a) Per text later in your reply, the stderr that the linker outputs  
will be displayed, and all is good (i.e., the user is properly  
notified of the correct error).
b) Also, from my point of view, this is perfectly acceptable -- all I  
want to know is whether the process *launched*.  If something  
happened *after* that, that's a different class of error as far as  
I'm concerned.

Here's a trivial program that I just hacked up to show this trick  
(pardon the lack of error checking -- it's a proof-of-concept  
intended for example only):

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>

char *to_exec[] = { "/bin/echo", "hello there", NULL };

int main(int argc, char* argv) {
     pid_t pid;
     int fd[2];
     int ret;

     /* avoiding most error checking for the sake of this example */
     pid = fork();
     if (pid == 0) {
         /* child */
         /* close the reading end */
         /* Set the writing end to be close-on-exec */
         fcntl(fd[1], F_SETFD, FD_CLOEXEC);
         /* sync with parent -- indicating "ready!" */
         write(fd[1], &ret, sizeof(int));
         /* do the exec */
         ret = execv(to_exec[0], to_exec);
         /* doh! -- an error occurred */
         write(fd[1], &ret, 1);
     } else {
         /* parent */
         /* close the writing end */
         /* wait for child to sync */
         read(fd[0], &ret, sizeof(int));
         /* Now wait for pipe to either close or indicate an error */
         while (1) {
             ret = read(fd[0], &ret, sizeof(int));
             if (ret < 0) {
                 if (errno == EINTR) {
                 perror("error in parent read");
             } else if (0 == ret) {
                 printf("Got eof from child -- must be a successful  
             } else {
                 printf("Got non-eof from child -- error %d\n", ret);
     /* never gets here */

Try playing with different values in to_exec[] and you'll see what I  

>> Also, calling tm_obit() and then looping over tm_poll() to check for
>> whether Open MPI was able to correctly spawn the process is not
>> feasible for us.  How long should we loop over tm_poll()?  It's  
>> always
>> a race condition -- consider if we were tm_spawn()ing on a slow node;
>> it could take a long time for tm_poll() to indicate that the task has
>> "died."  How long, exactly, should Open MPI poll before calling
>> tm_finalize() to allow some other process to talk to the MOM?  The
> How long does mpirun wait?  Clearly there must be a reasonable  
> timeout.

No, that's not what I'm saying.  mpirun (at least in Open MPI) is a  
sequential set of clearly-defined phases.  First we start processes  
(and get a success/fail for each of them), then the rendezvous during  
MPI_INIT occurs (but only if necessary, and mpirun itself may or may  
not have visibility on it -- the rendezvous may be handled by an  
independent agent of which mpirun is not a part), then mpirun waits  
for processes to MPI_FINALIZE and/or be notified by the launcher that  
the processes have died.

My point is that the first phase needs to be bounded -- but not by an  
arbitrary timeout (because arbitrary timeouts can always fail for  
unforeseen "slow" scenarios).  If the first phase is not bounded,  
then I'll have to mix phases and make a generalized progression  
engine, which I *really* do not want to do (it would have to be quite  

> Have you tried this a 2.0.0 TORQUE?  Problems after the exec() are
> reported to the job's stderr:
> (on a 4 proc job)
> $ pbsdsh ljkhadf
> PBS: ljkhadf: No such file or directory

This is a big improvement -- thanks!

What does PBS Pro do here?  I.e., can we rely on the same behavior  
from them?  (back to my "TM consumers need a more-or-less single set  
of behaviors to code to/rely on" mantra)

>> So -- after all my explanations -- I'd like to ask if Torque's TM
>> implementation can be amended to include "unable to launch" in the
>> definition of "delayed error" for tm_spawn().  This would mean that
>> tm_poll() would return an error when the tm_spawn() task  
>> completed, and
>> consumers of TM (like MPI) would be able to differentiate between
>> failing to launch and failing after launch.  Indeed, it would be  
>> really
>> great if Torque could provide a somewhat-fine-grained set of error
>> codes to differentiate between different reasons as to *why*  
>> something
>> failed to launch (couldn't find the executable, exec() failed, out of
>> resources, etc.).
> I think we have that now.  You just need to call tm_obit().

Well, that's not what exactly I was asking for (it still has the  
timeout problem).  :-)

Specifically, in accordance with the PSCHED API document, I was  
asking for the delayed error to be reported in the return from tm_poll 
() from the spawn.  With a trick like the close-on-exec pipe, this is  
certainly possible.

Per your definitions, I guess I'm asking for "delayed error" to  
include "task immediately failed due to process failing to start."

> Yes, but have you tested my patch? :)

Our development cluster is still running Torque 1.2.something... :- 
(  I could swear that I had filed a ticket to get it upgraded, but  
apparently I didn't.  So I filed one over this past weekend with our  
sysadmins to upgrade it to 2.0 + your patch.

{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

More information about the torqueusers mailing list