[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Tue Dec 6 12:43:24 MST 2005


On Dec 6, 2005, at 1:02 PM, Garrick Staples wrote:

>> Gotcha.  What exactly is 254?  Is it an Exxxx errno code that I can
>> compare to?  If not, is there a documented list of the codes that I
>> can compare against?
>
> 254<<1 = 127 = bourne for "command not found"

Hmm -- I don't understand your math there: 254 << 1 == 508.

I'm also curious as to why you specified "bourne" -- are all error  
statuses reported per bourne shell semantics?  What if the user's  
default shell is something other than bourne?

According to the Bash man page, I see the following:

     A full search of the directories in PATH is
     performed only if the command is not found in the hash table.   
If the
     search is unsuccessful, the shell prints an error message and  
returns
     an exit status of 127.

So I can see where you get 127, but I don't understand the  
transformation from 254.  Is that something that Torque does?

>> For example -- use a close-on-exec pipe.  The parent can block on a
>> pipe after the fork() -- if it closes, the exec() succeeded.  If the
>> child's exec() fails, it can send a message back up the pipe saying
>> "help, I failed!"  This is not 100% foolproof, because at some point
>> during exec(), the pipe will close but exec() could still fail, but
>> it usually covers many common cases of failure (e.g., file not found,
>> access denied, etc.).
>
>>         /* Set the writing end to be close-on-exec */
>>         fcntl(fd[1], F_SETFD, FD_CLOEXEC);
>
> That's a terrific method!  I've never seen that before (that's why you
> are a programmer and I'm a sysadmin)!  Is this portable?

Heh.  I'm of a firm belief that all programmers should be a sysadmin  
for a year (and vice versa).  The would would be a better place.

Yes, close-on-exec is portable.  We use it in LAM/MPI; it's discussed  
in Stevens (I haven't read the new version yet, though).

How far exec() goes before closing fd's is something that would need  
to be tested on different OS's.

> I can definitely roll that technique into task and job launching.

Does that mean the return from the poll for tm_spawn's event will  
show the error?

> Hehe, you could have done this yourself and sent me a patch in less  
> time
> then it took to explain it to me.  But I appreciate it.

I'm at a university, what did you expect?  ;-)

Thanks!

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/





More information about the torqueusers mailing list