[torqueusers] Interpreting Exit_status in server accounting files

Jeroen van den Muyzenberg Jeroen.vandenMuyzenberg at csiro.au
Wed Jan 11 02:17:05 MST 2006


Hi Ole,

As Garrick pointed out, the positive exit codes are generated by the job
itself and otherwise have no meaning to torque.

However just confirming this, I did have some test jobs return an exit
status of 126 for no apparent reason. Reruns returned the correct exit
status.

Jeroen

On Wed, 11 Jan 2006, Ole Holm Nielsen wrote:

> Hi Garrick,
>
> OK, so negative values of "Exit_status" in the accounting logs
> are well-defined.  What I'm missing are definitions of positive
> values of Exit_status.  Jeroen's guess (see below) makes a lot of
> sense, except it doesn't explain values between 32 and 127.
>
> Surely there must be a well-defined algorithm in Torque which
> assigns a value to Exit_status, but I can't figure out how it's
> done.  It would be a nice feature of Torque if it could inform
> us about the fate of jobs for 1) detection of user errors,
> 2) accounting purposes, 3) statistics which may point to system
> related problems.  Does anyone know if this can be done ?
>
> Thanks,
> Ole
>
> Garrick Staples wrote:
>>  Negative exit values are "special."  They are generated by MOM to
>>  indicate an error outside of the job.  The specific meaning of each is
>>  the JOB_EXEC_* defines in job.h.
>>
>>  Positive exit values are from the user's job.  It is just whatever the
>>  job returned and can't be reliably interpreted without looking at the
>>  job.
> ...
>> > >  On Tue, 10 Jan 2006, Jeroen van den Muyzenberg wrote:
>> > > > > The exit status should be (haven't checked) the return from the 
>> > > > > exec'd
>> > > > > job. We've had a look at them recently and they do seem to conform 
>> > > > > to;
>> > > > > 
>> > > > >     Exit_status >> 8 # Actual exit value
>> > > > >     Exit_status & 127 # Signal number if thus killed
>> > > > >     Exit_status & 128 # True if a core dump happened
> ...
>> > > > > > >  I'm working on the "pbsacct" accounting package for Torque/PBS
>> > > > > > >  and would like to understand the meaning of the "Exit_status"
>> > > > > > >  numbers in the server accounting files.  Unfortunately, I
>> > > > > > >  haven't been able to find a list of exit status values in the
>> > > > > > >  Torque source tree.  Going through some of our accounting 
>> > > > > > >  files,
>> > > > > > >  I find a number of jobs with non-zero "Exit_status" values
>> > > > > > >  such as: 1, 126, 127, 139, 143, 265, 271.
>> > > > > > > 
>> > > > > > >  Question: How do I assign a meaning to these "Exit_status" 
>> > > > > > >  values
>> > > > > > >  so that I can decide whether or not to flag a job termination 
>> > > > > > >  as OK
>> > > > > > >  (or just sort of OK) or as "failed" in the accounting output ?
>> > > > > > >  It would also be nice to know if a job exited because of wall 
>> > > > > > >  or
>> > > > > > >  cpu time exceeded.
>
> -- 
> Ole Holm Nielsen
> Department of Physics, Technical University of Denmark
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


More information about the torqueusers mailing list