[torqueusers] TM improvements
garrick at usc.edu
Tue Dec 6 11:02:01 MST 2005
On Tue, Dec 06, 2005 at 10:12:24AM -0500, Jeff Squyres alleged:
> On Dec 2, 2005, at 9:21 PM, Garrick Staples wrote:
> >>I think I see the disconnect between what I am saying and what you
> >>reporting. Specifically note that in your example, pbsdsh did *not*
> >>report "executable not found" -- it just said that the task exited
> >>status 254.
> >But that is precisely what 254 means. It means the final exec()
> >You can poll for that.
> Gotcha. What exactly is 254? Is it an Exxxx errno code that I can
> compare to? If not, is there a documented list of the codes that I
> can compare against?
254<<1 = 127 = bourne for "command not found"
> >>3. Unix fork()/exec() semantics are similar to #2 (indeed, the
> >>COMM_SPAWN semantics were at least partially inspired by fork()/
> >>semantcs). If fork() fails, you find out from the return of fork
> >>() --
> >>not by calling wait() to see what happened to the child. And if
> >>fails, you find out from the return of exec(), not by launching a
> >>process that immediately returns a status of 254. Granted, fork()
> >>exec() are synchronous, but if you extrapolate and make their
> >>terminations subject to some kind of polling mechanism, I would
> >>them to report their failures directly (e.g., when I poll for
> >>completion of fork() and/or exec()).
> >Not really. How does the parent process ever know the child passed
> >through the exec()?
> I guess this is where our disconnect is -- there are ways to do this.
> For example -- use a close-on-exec pipe. The parent can block on a
> pipe after the fork() -- if it closes, the exec() succeeded. If the
> child's exec() fails, it can send a message back up the pipe saying
> "help, I failed!" This is not 100% foolproof, because at some point
> during exec(), the pipe will close but exec() could still fail, but
> it usually covers many common cases of failure (e.g., file not found,
> access denied, etc.).
> /* Set the writing end to be close-on-exec */
> fcntl(fd, F_SETFD, FD_CLOEXEC);
That's a terrific method! I've never seen that before (that's why you
are a programmer and I'm a sysadmin)! Is this portable?
I can definitely roll that technique into task and job launching.
Hehe, you could have done this yourself and sent me a patch in less time
then it took to explain it to me. But I appreciate it.
> >Have you tried this a 2.0.0 TORQUE? Problems after the exec() are
> >reported to the job's stderr:
> >(on a 4 proc job)
> >$ pbsdsh ljkhadf
> >PBS: ljkhadf: No such file or directory
> This is a big improvement -- thanks!
> What does PBS Pro do here? I.e., can we rely on the same behavior
> from them? (back to my "TM consumers need a more-or-less single set
> of behaviors to code to/rely on" mantra)
I don't know. I don't have PBS Pro.
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051206/2fa7262b/attachment.bin
More information about the torqueusers