[torqueusers] TM improvements

Jeff Squyres jsquyres at open-mpi.org
Tue Dec 6 14:02:37 MST 2005


On Dec 6, 2005, at 3:42 PM, Garrick Staples wrote:

>>> 254<<1 = 127 = bourne for "command not found"
>>
>> Hmm -- I don't understand your math there: 254 << 1 == 508.
>
> My bad.  I mean 254>>1.

Ah, ok.  So the LSB is reserved for something...?

>> I'm also curious as to why you specified "bourne" -- are all error
>> statuses reported per bourne shell semantics?  What if the user's
>> default shell is something other than bourne?
>
> The user's shell isn't involved with spawning tasks.  The specified
> command is directly exec()'d.

Hmm.  This seems inconsistent:

Using tcsh:

-----
[15:49] vogon:~ % asdfasdfasdf
asdfasdfasdf: Command not found.
[15:49] vogon:~ % echo $status
1
-----

Using sh/bash:

-----
[15:49] vogon:~ % sh
bash-3.00 ~$ asdfsadf
bash: asdfsadf: command not found
bash-3.00 ~$ echo $?
127
bash-3.00 ~$ exit
exit
[15:51] vogon:~ % bash
bash-3.00 ~$ sdfgasdrgsdfg
bash: sdfgasdrgsdfg: command not found
bash-3.00 ~$ echo $?
127
bash-3.00 ~$
-----

But when a user with tcsh as their default shell (i.e., me):

-----
[15:52] vogon:~ % qsub -I -lnodes=1
qsub: waiting for job 51.vogon.osl.iu.edu to start
qsub: job 51.vogon.osl.iu.edu ready

Setting up for: i686-pc-linux-gnu
mesg: error: tty device is not owned by group `tty'
Already sent thought for the day
dikim has logged on pts/3 from dhcp-cs-244-141.cs.indiana.edu.
[15:52] eddie:~ % echo $shell
/bin/tcsh
[15:52] eddie:~ % pbsdsh sdfgsdfgads
pbsdsh: task 0 exit status 254
[15:52] eddie:~ %
-----

So that 254 is coming out even though I don't have bourne as my  
shell.  tcsh reports 1 when something is not found -- here's the tcsh  
man page's description of $status:

-----
        status  The status returned by the  last  command.   If  it   
terminated
                abnormally, then 0200 is added to the status.   
Builtin commands
                which fail return exit status `1', all other   
builtin  commands
                return status `0'.
-----

(this is all in Linux)

>> So I can see where you get 127, but I don't understand the
>> transformation from 254.  Is that something that Torque does?
>
> I'm not sure.  This behaviour pre-dates TORQUE.  I recall reading
> something , though I can't find it right now, about exit values being
> bitshifted if it's not directly from the process.  I think it might  
> have
> been in the ERS.  I think it is following POSIX semantics for  
> something.

Any references you can find here would be helpful.  I note that  
WIFEXITED and WEXITSTATUS return 0 for the value of 254 on the same  
machine that I did the pbsdsh test on:

-----
#include <sys/types.h>
#include <sys/wait.h>

int main(int argc, char* argv[])
{
     printf("wifexited: %d\n", WIFEXITED(254));
     printf("wexitstatus: %d\n", WEXITSTATUS(254));
     return 0;
}
-----

[15:56] odin003:~/tmp % gcc foo.c -o foo
[15:56] odin003:~/tmp % ./foo
wifexited: 0
wexitstatus: 0

So something else is going on here...

This may well be moot based on your later text:

>>> I can definitely roll that technique into task and job launching.
>>
>> Does that mean the return from the poll for tm_spawn's event will
>> show the error?
>
> Yes.  tm_spawn() can return an error of 1, 127, or 254, or  
> TM_ENOTFOUND
> or something.

TM_ENOTFOUND would be awesome.  :-)  Is that consistent with the  
tm_error values returned in other cases?  (i.e., that they're TM_*  
constants)

My only concern is that if *you're* setting TM_ENOTFOUND based on the  
value 254, then we should understand what that 254 *means* (i.e.,  
will it really be 254 on all platforms?).

Also, can we have some other problems reported as well?  Based on  
output you sent previously in this thread, I think you already handle  
the "not found" case -- sending TM_ENOTFOUND back should be easy.   
Can we recognize permission denied cases as well?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/





More information about the torqueusers mailing list