[torqueusers] TM improvements
Jeff Squyres
jsquyres at open-mpi.org
Tue Dec 6 14:02:37 MST 2005
On Dec 6, 2005, at 3:42 PM, Garrick Staples wrote:
>>> 254<<1 = 127 = bourne for "command not found"
>>
>> Hmm -- I don't understand your math there: 254 << 1 == 508.
>
> My bad. I mean 254>>1.
Ah, ok. So the LSB is reserved for something...?
>> I'm also curious as to why you specified "bourne" -- are all error
>> statuses reported per bourne shell semantics? What if the user's
>> default shell is something other than bourne?
>
> The user's shell isn't involved with spawning tasks. The specified
> command is directly exec()'d.
Hmm. This seems inconsistent:
Using tcsh:
-----
[15:49] vogon:~ % asdfasdfasdf
asdfasdfasdf: Command not found.
[15:49] vogon:~ % echo $status
1
-----
Using sh/bash:
-----
[15:49] vogon:~ % sh
bash-3.00 ~$ asdfsadf
bash: asdfsadf: command not found
bash-3.00 ~$ echo $?
127
bash-3.00 ~$ exit
exit
[15:51] vogon:~ % bash
bash-3.00 ~$ sdfgasdrgsdfg
bash: sdfgasdrgsdfg: command not found
bash-3.00 ~$ echo $?
127
bash-3.00 ~$
-----
But when a user with tcsh as their default shell (i.e., me):
-----
[15:52] vogon:~ % qsub -I -lnodes=1
qsub: waiting for job 51.vogon.osl.iu.edu to start
qsub: job 51.vogon.osl.iu.edu ready
Setting up for: i686-pc-linux-gnu
mesg: error: tty device is not owned by group `tty'
Already sent thought for the day
dikim has logged on pts/3 from dhcp-cs-244-141.cs.indiana.edu.
[15:52] eddie:~ % echo $shell
/bin/tcsh
[15:52] eddie:~ % pbsdsh sdfgsdfgads
pbsdsh: task 0 exit status 254
[15:52] eddie:~ %
-----
So that 254 is coming out even though I don't have bourne as my
shell. tcsh reports 1 when something is not found -- here's the tcsh
man page's description of $status:
-----
status The status returned by the last command. If it
terminated
abnormally, then 0200 is added to the status.
Builtin commands
which fail return exit status `1', all other
builtin commands
return status `0'.
-----
(this is all in Linux)
>> So I can see where you get 127, but I don't understand the
>> transformation from 254. Is that something that Torque does?
>
> I'm not sure. This behaviour pre-dates TORQUE. I recall reading
> something , though I can't find it right now, about exit values being
> bitshifted if it's not directly from the process. I think it might
> have
> been in the ERS. I think it is following POSIX semantics for
> something.
Any references you can find here would be helpful. I note that
WIFEXITED and WEXITSTATUS return 0 for the value of 254 on the same
machine that I did the pbsdsh test on:
-----
#include <sys/types.h>
#include <sys/wait.h>
int main(int argc, char* argv[])
{
printf("wifexited: %d\n", WIFEXITED(254));
printf("wexitstatus: %d\n", WEXITSTATUS(254));
return 0;
}
-----
[15:56] odin003:~/tmp % gcc foo.c -o foo
[15:56] odin003:~/tmp % ./foo
wifexited: 0
wexitstatus: 0
So something else is going on here...
This may well be moot based on your later text:
>>> I can definitely roll that technique into task and job launching.
>>
>> Does that mean the return from the poll for tm_spawn's event will
>> show the error?
>
> Yes. tm_spawn() can return an error of 1, 127, or 254, or
> TM_ENOTFOUND
> or something.
TM_ENOTFOUND would be awesome. :-) Is that consistent with the
tm_error values returned in other cases? (i.e., that they're TM_*
constants)
My only concern is that if *you're* setting TM_ENOTFOUND based on the
value 254, then we should understand what that 254 *means* (i.e.,
will it really be 254 on all platforms?).
Also, can we have some other problems reported as well? Based on
output you sent previously in this thread, I think you already handle
the "not found" case -- sending TM_ENOTFOUND back should be easy.
Can we recognize permission denied cases as well?
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/
More information about the torqueusers
mailing list