[torqueusers] Interpreting Exit_status in server accountin g files

Rushton Martin JMRUSHTON at qinetiq.com
Wed Jan 11 03:51:52 MST 2006


>From the BASH man page:

"...  When a command terminates on a fatal signal N, bash uses the value
128+N as the exit status.

"If a command is not found, the child process created to execute it returns
a status of 127.  If a command is found but is not executable, the return
status is 126.
...
"Bash itself returns the exit status of the last command executed, ..."

In my experience, status=1 is a general unspecified error.  The OP's error
codes 139 and 143 are therefore signals 11 (SIGSEGV) and 15 (SIGTERM).

Martin Rushton
Lethal Mechanisms

QinetiQ
Bldg H4 Rm 6
MoD Fort Halstead
Sevenoaks
Kent, TN14 7BP

Tel:    01959 514777
Email:  jmrushton at QinetiQ.com
Fax:    01959 51 6050
Web:    www.QinetiQ.com

QinetiQ - The Global Defence and Security Experts. 

| -----Original Message-----
| From: torqueusers-bounces at supercluster.org 
| [mailto:torqueusers-bounces at supercluster.org] On Behalf Of 
| Jeroen van den Muyzenberg
| Sent: 11 January 2006 09:17
| To: Ole Holm Nielsen
| Cc: torqueusers at supercluster.org
| Subject: Re: [torqueusers] Interpreting Exit_status in server 
| accounting files
| 
| Hi Ole,
| 
| As Garrick pointed out, the positive exit codes are generated 
| by the job itself and otherwise have no meaning to torque.
| 
| However just confirming this, I did have some test jobs 
| return an exit status of 126 for no apparent reason. Reruns 
| returned the correct exit status.
| 
| Jeroen
| 
| On Wed, 11 Jan 2006, Ole Holm Nielsen wrote:
| 
| > Hi Garrick,
| >
| > OK, so negative values of "Exit_status" in the accounting logs are 
| > well-defined.  What I'm missing are definitions of positive 
| values of 
| > Exit_status.  Jeroen's guess (see below) makes a lot of 
| sense, except 
| > it doesn't explain values between 32 and 127.
| >
| > Surely there must be a well-defined algorithm in Torque 
| which assigns 
| > a value to Exit_status, but I can't figure out how it's done.  It 
| > would be a nice feature of Torque if it could inform us 
| about the fate 
| > of jobs for 1) detection of user errors,
| > 2) accounting purposes, 3) statistics which may point to system 
| > related problems.  Does anyone know if this can be done ?
| >
| > Thanks,
| > Ole
| >
| > Garrick Staples wrote:
| >>  Negative exit values are "special."  They are generated 
| by MOM to  
| >> indicate an error outside of the job.  The specific 
| meaning of each 
| >> is  the JOB_EXEC_* defines in job.h.
| >>
| >>  Positive exit values are from the user's job.  It is just 
| whatever 
| >> the  job returned and can't be reliably interpreted 
| without looking 
| >> at the  job.
| > ...
| >> > >  On Tue, 10 Jan 2006, Jeroen van den Muyzenberg wrote:
| >> > > > > The exit status should be (haven't checked) the 
| return from 
| >> > > > > the exec'd job. We've had a look at them recently 
| and they do 
| >> > > > > seem to conform to;
| >> > > > > 
| >> > > > >     Exit_status >> 8 # Actual exit value
| >> > > > >     Exit_status & 127 # Signal number if thus killed
| >> > > > >     Exit_status & 128 # True if a core dump happened
| > ...
| >> > > > > > >  I'm working on the "pbsacct" accounting package for 
| >> > > > > > > Torque/PBS  and would like to understand the 
| meaning of the "Exit_status"
| >> > > > > > >  numbers in the server accounting files.  
| Unfortunately, 
| >> > > > > > > I  haven't been able to find a list of exit 
| status values 
| >> > > > > > > in the  Torque source tree.  Going through some of our 
| >> > > > > > > accounting  files,  I find a number of jobs 
| with non-zero 
| >> > > > > > > "Exit_status" values  such as: 1, 126, 127, 139, 143, 
| >> > > > > > > 265, 271.
| >> > > > > > > 
| >> > > > > > >  Question: How do I assign a meaning to these 
| "Exit_status" 
| >> > > > > > >  values
| >> > > > > > >  so that I can decide whether or not to flag a job 
| >> > > > > > > termination  as OK  (or just sort of OK) or as 
| "failed" 
| >> > > > > > > in the accounting output ?
| >> > > > > > >  It would also be nice to know if a job exited 
| because of 
| >> > > > > > > wall  or  cpu time exceeded.
| >
| > --
| > Ole Holm Nielsen
| > Department of Physics, Technical University of Denmark 
| > _______________________________________________
| > torqueusers mailing list
| > torqueusers at supercluster.org
| > http://www.supercluster.org/mailman/listinfo/torqueusers
| >
| _______________________________________________
| torqueusers mailing list
| torqueusers at supercluster.org
| http://www.supercluster.org/mailman/listinfo/torqueusers
| 

The Information contained in this E-Mail and any subsequent correspondence
is private and is intended solely for the intended recipient(s).
For those other than the recipient any disclosure, copying, distribution,
or any action taken or omitted to be taken in reliance on such information
is prohibited and may be unlawful.

Emails and other electronic communication with QinetiQ may be monitored.
Calls to QinetiQ may be recorded for quality control,
regulatory and monitoring purposes.


More information about the torqueusers mailing list