[torqueusers] TM improvements

Garrick Staples garrick at usc.edu
Fri Dec 2 19:21:24 MST 2005


On Thu, Dec 01, 2005 at 07:49:25PM -0500, Jeff Squyres alleged:
> On Nov 23, 2005, at 6:59 AM, Jeff Squyres wrote:
> 
> >>>3. If you tm_spawn() something that fails to launch properly (e.g.,
> >>>file not found on the remote node), there is no error notification 
> >>>sent
> >>>back to the process that invoked tm_spawn().
> >>
> >>$ pbsdsh -c 1 lkjahdsfljahdf
> >>pbsdsh: task 0 exit status 254
> >>
> >>$ pbsdsh -c 2 lkjahdsfljahdf
> >>pbsdsh: task 0 exit status 254
> >>pbsdsh: task 1 exit status 254
> >
> >I was quite definitely running into this problem with our TM launcher 
> >in Open MPI.  I don't have time at the moment (am catching a flight in 
> >a few hours), but I'll try to replicate it next week and send more 
> >details.
> 
> SC and the US Thanksgiving holiday are over, and I've [somewhat] 
> managed to dig out from the mound of work that accumulated over the 
> two, so I am now able to return to attempting to replicate this error.
> 
> I think I see the disconnect between what I am saying and what you are 
> reporting.  Specifically note that in your example, pbsdsh did *not* 
> report "executable not found" -- it just said that the task exited with 
> status 254.

But that is precisely what 254 means.  It means the final exec() failed.
You can poll for that.


[...heavily snipped...]
> This is why I assumed that when tm_poll() indicated that the event 
> associated with tm_spawn() finished and I got local_err == 0 (referred 
> to as tm_errno in the PSCHED API document), then there were no delayed 
> errors, and the process must have been successfully launched.  
> Unfortunately, this is apparently not what PBS/Torque is indicating.  I 
> guess it comes down to what, exactly, is Torque's definition of a 
> delayed error...?
> 
> In my simple example above, I am not waiting for the death of the 
> spawned task -- I am firing and forgetting.  This is because of the 
> other problems that I mentioned -- specifically, until the patch a 
> little while ago, we could not keep a tm connection open.  Hence, we 
> would launch the process, poll once to check for error with the spawn, 
> and then tm_finalize() so that other processes could talk to the MOM.
> 
> Because of your example, I just went an examined the source of pbsdsh.  
> I see that spawn errors are only reported when the task "dies" -- i.e., 
> pbsdsh is notified through the tm_obit() / tm_poll() mechanism.  Hence, 
> we get the status 254 from the task death.
> 
> This is not what I expected -- indeed, it seems somewhat weird to me.   
>  Here's why:
> 
> 1. tm_obit() is supposed to notify you when a task dies.  But if it was 
> never successfully launched in the first place, how could it die?

But the task was successfully launched!

The "task" isn't the process you are executing.  It is all of the moving
parts in MOM required to manage remote processes.

Look at start_process() in resmom/start_exec.c.  The starter_return()
function near the bottom is the cut-off between a successful and failed
spawn.  It obviously can't happen after the exec(), the affirmative
message must be sent before the final exec().

 
> 2. MPI makes the distinction in MPI_COMM_SPAWN between failing to 
> launch the desired application and dying after the launch.  
> Specifically, if the desired application fails to launch, 
> MPI_COMM_SPAWN will return the error, not something else later.

This is because MPI has the convienence of having a feedback mechanism.
When you have child processes telling you they've started, it makes
it easy to know  they were exec()'d.

 
> 3. Unix fork()/exec() semantics are similar to #2 (indeed, the 
> COMM_SPAWN semantics were at least partially inspired by fork()/exec() 
> semantcs).  If fork() fails, you find out from the return of fork() -- 
> not by calling wait() to see what happened to the child.  And if exec() 
> fails, you find out from the return of exec(), not by launching a bogus 
> process that immediately returns a status of 254.  Granted, fork() and 
> exec() are synchronous, but if you extrapolate and make their 
> terminations subject to some kind of polling mechanism, I would expect 
> them to report their failures directly (e.g., when I poll for 
> completion of fork() and/or exec()).

Not really.  How does the parent process ever know the child passed
through the exec()?

 
> #2 and #3 are not directly applicable to TM, of course, but they 
> contributed to my thinking process / bias.  You may not agree with my 
> rationale :-), but I just want to explain my mindset and why I am 
> surprised by Torque's TM behavior.

tm_spawn() is exactly like for()/exec(), where it immediately know if
fork() failed, but you don't know about the exec().  For the later you
need to use wait().  tm_obit()/tm_poll() is equivalent to wait().

 
> Also, calling tm_obit() and then looping over tm_poll() to check for 
> whether Open MPI was able to correctly spawn the process is not 
> feasible for us.  How long should we loop over tm_poll()?  It's always 
> a race condition -- consider if we were tm_spawn()ing on a slow node; 
> it could take a long time for tm_poll() to indicate that the task has 
> "died."  How long, exactly, should Open MPI poll before calling 
> tm_finalize() to allow some other process to talk to the MOM?  The 

How long does mpirun wait?  Clearly there must be a reasonable timeout.


> obituary could always come in after our polling has completed and we 
> have finalized.  Remember that in our model, the desirable / "correct" 
> case (i.e., when TM is able to successfully spawn a process) will have 
> no obituaries arrive.
> 
> More specifically, I always assumed that spawning was a finite/bounded 
> event.  Although spawning could be "slow," we will *always* be 
> eventually told of the success or failure of a spawn.  Hence, there is 
> a distinct difference between when a spawn fails and when a 
> successfully-spawned process dies.

This is only true with a blocking wait() or an infinite loop over a
non-blocking wait().

The final "success" of the remote exec() must be done out-of-bounds.  If
the exec() failed, then you get a 254 after tm_obit()/tm_poll().  But
you must have a feedback mechanism in your program to test for success.
This is exactly like fork()/exec()/wait().

 
> I'm hammering on this point because it's one thing to tell a user "your 
> processes failed to launch"; it's quite a different thing to tell a 
> user "your process died after it was launched."  In the former case, 
> I'd start looking at spelling of the executable name, location of 
> executables, PATH and LD_LIBRARY_PATH, etc.  In the latter case, I'd 
> look at the code to see what happened after I hit main(), look for core 
> files, etc.  Right now, it's impossible for an MPI implementation to 
> tell the difference, and that's bad for users (and we, the MPI 
> implementors, get blamed for making the user search their code for 
> hours when, in reality, there was a typo on the mpirun command line 
> such that the executable name was misspelled :-) ).

Have you tried this a 2.0.0 TORQUE?  Problems after the exec() are
reported to the job's stderr:

(on a 4 proc job)
$ pbsdsh ljkhadf
PBS: ljkhadf: No such file or directory
PBS: ljkhadf: No such file or directory
PBS: ljkhadf: No such file or directory
pbsdsh: task 0 exit status 254
PBS: ljkhadf: No such file or directory
pbsdsh: task 1 exit status 254
pbsdsh: task 2 exit status 254
pbsdsh: task 3 exit status 254

$ pbsdsh ./src/c/tester
./src/c/tester: error while loading shared libraries: libmpich.so.1.0: cannot open shared object file: No such file or directory
./src/c/tester: error while loading shared libraries: libmpich.so.1.0: cannot open shared object file: No such file or directory
./src/c/tester: error while loading shared libraries: libmpich.so.1.0: cannot open shared object file: No such file or directory
pbsdsh: task 0 exit status 127
./src/c/tester: error while loading shared libraries: libmpich.so.1.0: cannot open shared object file: No such file or directory
pbsdsh: task 1 exit status 127
pbsdsh: task 2 exit status 127
pbsdsh: task 3 exit status 127

$ pbsdsh /bin/sh -c lkjhadsf
/bin/sh: line 1: lkjhadsf: command not found
/bin/sh: line 1: lkjhadsf: command not found
/bin/sh: line 1: lkjhadsf: command not found
pbsdsh: task 0 exit status 127
/bin/sh: line 1: lkjhadsf: command not found
pbsdsh: task 1 exit status 127
pbsdsh: task 2 exit status 127
pbsdsh: task 3 exit status 127


 
> So -- after all my explanations -- I'd like to ask if Torque's TM 
> implementation can be amended to include "unable to launch" in the 
> definition of "delayed error" for tm_spawn().  This would mean that 
> tm_poll() would return an error when the tm_spawn() task completed, and 
> consumers of TM (like MPI) would be able to differentiate between 
> failing to launch and failing after launch.  Indeed, it would be really 
> great if Torque could provide a somewhat-fine-grained set of error 
> codes to differentiate between different reasons as to *why* something 
> failed to launch (couldn't find the executable, exec() failed, out of 
> resources, etc.).

I think we have that now.  You just need to call tm_obit().

 
> If you're still reading this extremely lengthy mail, thanks for your 
> time.  :-)

Yes, but have you tested my patch? :)

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051202/95310039/attachment.bin


More information about the torqueusers mailing list