[torqueusers] TM improvements

Garrick Staples garrick at usc.edu
Tue Dec 6 11:02:01 MST 2005


On Tue, Dec 06, 2005 at 10:12:24AM -0500, Jeff Squyres alleged:
> On Dec 2, 2005, at 9:21 PM, Garrick Staples wrote:
> 
> >>I think I see the disconnect between what I am saying and what you  
> >>are
> >>reporting.  Specifically note that in your example, pbsdsh did *not*
> >>report "executable not found" -- it just said that the task exited  
> >>with
> >>status 254.
> >
> >But that is precisely what 254 means.  It means the final exec()  
> >failed.
> >You can poll for that.
> 
> Gotcha.  What exactly is 254?  Is it an Exxxx errno code that I can  
> compare to?  If not, is there a documented list of the codes that I  
> can compare against?

254<<1 = 127 = bourne for "command not found"

 
> >>3. Unix fork()/exec() semantics are similar to #2 (indeed, the
> >>COMM_SPAWN semantics were at least partially inspired by fork()/ 
> >>exec()
> >>semantcs).  If fork() fails, you find out from the return of fork 
> >>() --
> >>not by calling wait() to see what happened to the child.  And if  
> >>exec()
> >>fails, you find out from the return of exec(), not by launching a  
> >>bogus
> >>process that immediately returns a status of 254.  Granted, fork()  
> >>and
> >>exec() are synchronous, but if you extrapolate and make their
> >>terminations subject to some kind of polling mechanism, I would  
> >>expect
> >>them to report their failures directly (e.g., when I poll for
> >>completion of fork() and/or exec()).
> >
> >Not really.  How does the parent process ever know the child passed
> >through the exec()?
> 
> I guess this is where our disconnect is -- there are ways to do this.
> 
> For example -- use a close-on-exec pipe.  The parent can block on a  
> pipe after the fork() -- if it closes, the exec() succeeded.  If the  
> child's exec() fails, it can send a message back up the pipe saying  
> "help, I failed!"  This is not 100% foolproof, because at some point  
> during exec(), the pipe will close but exec() could still fail, but  
> it usually covers many common cases of failure (e.g., file not found,  
> access denied, etc.).

>         /* Set the writing end to be close-on-exec */
>         fcntl(fd[1], F_SETFD, FD_CLOEXEC);

That's a terrific method!  I've never seen that before (that's why you
are a programmer and I'm a sysadmin)!  Is this portable?

I can definitely roll that technique into task and job launching.

Hehe, you could have done this yourself and sent me a patch in less time
then it took to explain it to me.  But I appreciate it.


> >Have you tried this a 2.0.0 TORQUE?  Problems after the exec() are
> >reported to the job's stderr:
> >
> >(on a 4 proc job)
> >$ pbsdsh ljkhadf
> >PBS: ljkhadf: No such file or directory
> 
> This is a big improvement -- thanks!
> 
> What does PBS Pro do here?  I.e., can we rely on the same behavior  
> from them?  (back to my "TM consumers need a more-or-less single set  
> of behaviors to code to/rely on" mantra)

I don't know.  I don't have PBS Pro.

 
-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20051206/2fa7262b/attachment.bin


More information about the torqueusers mailing list