[torqueusers] Interpreting Exit_status in server accounting files
David Singleton
David.Singleton at anu.edu.au
Wed Jan 11 02:08:12 MST 2006
The exit status may only be meaningful to the job's owner:
lc0:~ > qsub -q express
exit 32
705669.lc0
lc0:~ > tracejob 705669
Job: 705669.lc0
....
01/11/2006 20:03:33 S Obit received
01/11/2006 20:03:33 S Exit_status=32 resources_used.cput=00:00:00 resources_used.jobfs=0kb
resources_used.mem=100kb resources_used.syst=00:00:00 resources_used.vmem=600kb
resources_used.walltime=00:00:01
The user can give any exit status between 1 and 255 and it means
nothing to PBS.
David
Ole Holm Nielsen wrote:
> Hi Garrick,
>
> OK, so negative values of "Exit_status" in the accounting logs
> are well-defined. What I'm missing are definitions of positive
> values of Exit_status. Jeroen's guess (see below) makes a lot of
> sense, except it doesn't explain values between 32 and 127.
>
> Surely there must be a well-defined algorithm in Torque which
> assigns a value to Exit_status, but I can't figure out how it's
> done. It would be a nice feature of Torque if it could inform
> us about the fate of jobs for 1) detection of user errors,
> 2) accounting purposes, 3) statistics which may point to system
> related problems. Does anyone know if this can be done ?
>
> Thanks,
> Ole
>
> Garrick Staples wrote:
>
>> Negative exit values are "special." They are generated by MOM to
>> indicate an error outside of the job. The specific meaning of each is
>> the JOB_EXEC_* defines in job.h.
>>
>> Positive exit values are from the user's job. It is just whatever the
>> job returned and can't be reliably interpreted without looking at the
>> job.
>
> ...
>
>>>> On Tue, 10 Jan 2006, Jeroen van den Muyzenberg wrote:
>>>>
>>>>> >The exit status should be (haven't checked) the return from the
>>>>> exec'd
>>>>> >job. We've had a look at them recently and they do seem to conform
>>>>> to;
>>>>> >
>>>>> > Exit_status >> 8 # Actual exit value
>>>>> > Exit_status & 127 # Signal number if thus killed
>>>>> > Exit_status & 128 # True if a core dump happened
>
> ...
>
>>>>>> >> I'm working on the "pbsacct" accounting package for Torque/PBS
>>>>>> >> and would like to understand the meaning of the "Exit_status"
>>>>>> >> numbers in the server accounting files. Unfortunately, I
>>>>>> >> haven't been able to find a list of exit status values in the
>>>>>> >> Torque source tree. Going through some of our accounting files,
>>>>>> >> I find a number of jobs with non-zero "Exit_status" values
>>>>>> >> such as: 1, 126, 127, 139, 143, 265, 271.
>>>>>> >>
>>>>>> >> Question: How do I assign a meaning to these "Exit_status" values
>>>>>> >> so that I can decide whether or not to flag a job termination
>>>>>> as OK
>>>>>> >> (or just sort of OK) or as "failed" in the accounting output ?
>>>>>> >> It would also be nice to know if a job exited because of wall or
>>>>>> >> cpu time exceeded.
>
>
--
--------------------------------------------------------------------------
Dr David Singleton ANU Supercomputer Facility
HPC Systems Manager and APAC National Facility
David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56)
Phone: +61 2 6125 4389 Australian National University
Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------
More information about the torqueusers
mailing list