[torqueusers] Interpreting Exit_status in server accounting files

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Wed Jan 11 01:21:48 MST 2006


Hi Garrick,

OK, so negative values of "Exit_status" in the accounting logs
are well-defined.  What I'm missing are definitions of positive
values of Exit_status.  Jeroen's guess (see below) makes a lot of
sense, except it doesn't explain values between 32 and 127.

Surely there must be a well-defined algorithm in Torque which
assigns a value to Exit_status, but I can't figure out how it's
done.  It would be a nice feature of Torque if it could inform
us about the fate of jobs for 1) detection of user errors,
2) accounting purposes, 3) statistics which may point to system
related problems.  Does anyone know if this can be done ?

Thanks,
Ole

Garrick Staples wrote:
> Negative exit values are "special."  They are generated by MOM to
> indicate an error outside of the job.  The specific meaning of each is
> the JOB_EXEC_* defines in job.h.
> 
> Positive exit values are from the user's job.  It is just whatever the
> job returned and can't be reliably interpreted without looking at the
> job.
...
>>> On Tue, 10 Jan 2006, Jeroen van den Muyzenberg wrote:
>>>> >The exit status should be (haven't checked) the return from the exec'd
>>>> >job. We've had a look at them recently and they do seem to conform to;
>>>> >
>>>> >    Exit_status >> 8 # Actual exit value
>>>> >    Exit_status & 127 # Signal number if thus killed
>>>> >    Exit_status & 128 # True if a core dump happened
...
>>>>> >> I'm working on the "pbsacct" accounting package for Torque/PBS
>>>>> >> and would like to understand the meaning of the "Exit_status"
>>>>> >> numbers in the server accounting files.  Unfortunately, I
>>>>> >> haven't been able to find a list of exit status values in the
>>>>> >> Torque source tree.  Going through some of our accounting files,
>>>>> >> I find a number of jobs with non-zero "Exit_status" values
>>>>> >> such as: 1, 126, 127, 139, 143, 265, 271.
>>>>> >>
>>>>> >> Question: How do I assign a meaning to these "Exit_status" values
>>>>> >> so that I can decide whether or not to flag a job termination as OK
>>>>> >> (or just sort of OK) or as "failed" in the accounting output ?
>>>>> >> It would also be nice to know if a job exited because of wall or
>>>>> >> cpu time exceeded.

-- 
Ole Holm Nielsen
Department of Physics, Technical University of Denmark


More information about the torqueusers mailing list