[torqueusers] Interpreting Exit_status in server accounting files

David Singleton David.Singleton at anu.edu.au
Wed Jan 11 02:08:12 MST 2006


The exit status may only be meaningful to the job's owner:

lc0:~ > qsub -q express
exit 32
705669.lc0
lc0:~ > tracejob 705669

Job: 705669.lc0

....

01/11/2006 20:03:33  S    Obit received
01/11/2006 20:03:33  S    Exit_status=32 resources_used.cput=00:00:00 resources_used.jobfs=0kb
                           resources_used.mem=100kb resources_used.syst=00:00:00 resources_used.vmem=600kb
                           resources_used.walltime=00:00:01

The user can give any exit status between 1 and 255 and it means
nothing to PBS.

David


Ole Holm Nielsen wrote:
> Hi Garrick,
> 
> OK, so negative values of "Exit_status" in the accounting logs
> are well-defined.  What I'm missing are definitions of positive
> values of Exit_status.  Jeroen's guess (see below) makes a lot of
> sense, except it doesn't explain values between 32 and 127.
> 
> Surely there must be a well-defined algorithm in Torque which
> assigns a value to Exit_status, but I can't figure out how it's
> done.  It would be a nice feature of Torque if it could inform
> us about the fate of jobs for 1) detection of user errors,
> 2) accounting purposes, 3) statistics which may point to system
> related problems.  Does anyone know if this can be done ?
> 
> Thanks,
> Ole
> 
> Garrick Staples wrote:
> 
>> Negative exit values are "special."  They are generated by MOM to
>> indicate an error outside of the job.  The specific meaning of each is
>> the JOB_EXEC_* defines in job.h.
>>
>> Positive exit values are from the user's job.  It is just whatever the
>> job returned and can't be reliably interpreted without looking at the
>> job.
> 
> ...
> 
>>>> On Tue, 10 Jan 2006, Jeroen van den Muyzenberg wrote:
>>>>
>>>>> >The exit status should be (haven't checked) the return from the 
>>>>> exec'd
>>>>> >job. We've had a look at them recently and they do seem to conform 
>>>>> to;
>>>>> >
>>>>> >    Exit_status >> 8 # Actual exit value
>>>>> >    Exit_status & 127 # Signal number if thus killed
>>>>> >    Exit_status & 128 # True if a core dump happened
> 
> ...
> 
>>>>>> >> I'm working on the "pbsacct" accounting package for Torque/PBS
>>>>>> >> and would like to understand the meaning of the "Exit_status"
>>>>>> >> numbers in the server accounting files.  Unfortunately, I
>>>>>> >> haven't been able to find a list of exit status values in the
>>>>>> >> Torque source tree.  Going through some of our accounting files,
>>>>>> >> I find a number of jobs with non-zero "Exit_status" values
>>>>>> >> such as: 1, 126, 127, 139, 143, 265, 271.
>>>>>> >>
>>>>>> >> Question: How do I assign a meaning to these "Exit_status" values
>>>>>> >> so that I can decide whether or not to flag a job termination 
>>>>>> as OK
>>>>>> >> (or just sort of OK) or as "failed" in the accounting output ?
>>>>>> >> It would also be nice to know if a job exited because of wall or
>>>>>> >> cpu time exceeded.
> 
> 


-- 
--------------------------------------------------------------------------
    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and APAC National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------


More information about the torqueusers mailing list