TORQUE Administrator's Manual - 2.7 Job Exit Status
2.7 Job Exit Status
Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script.
This attribute can be seen by doing a qstat -f command to show the entire set of information associated with a job.
The exit_status field is found near the bottom of the set of output lines.
qstat -f (job failure example)
Job Id: 179.host
Job_Name = STDIN
Job_Owner = user@host
job_state = C
queue = batchq
server = host
Checkpoint = u
ctime = Fri Aug 29 14:55:55 2008
Error_Path = host:/opt/moab/STDIN.e179
exec_host = node1/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Aug 29 14:55:55 2008
Output_Path = host:/opt/moab/STDIN.o179
Priority = 0
qtime = Fri Aug 29 14:55:55 2008
Rerunable = True
Resource_List.ncpus = 2
Resource_List.nodect = 1
Resource_List.nodes = node1
Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user,
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host,
PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq
sched_hint = Post job file processing error; job 179.host on host node1/0Ba
d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file
etime = Fri Aug 29 14:55:55 2008
exit_status = -1
This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.
If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom.
Otherwise if the job script was successfully started, the value in this field will be return value of the script.
TORQUE Supplied Exit Codes
Name
Value
Description
JOB_EXEC_OK 0 job exec successful
JOB_EXEC_FAIL1 -1 job exec failed, before files, no retry
JOB_EXEC_FAIL2 -2 job exec failed, after files, no retry
JOB_EXEC_RETRY -3 job execution failed, do retry
JOB_EXEC_INITABT -4 job aborted on MOM initialization
JOB_EXEC_INITRST -5 job aborted on MOM init, chkpt, no migrate
JOB_EXEC_INITRMG -6 job aborted on MOM init, chkpt, ok migrate
JOB_EXEC_BADRESRT -7 job restart failed
JOB_EXEC_CMDFAIL -8 exec() of user command failed
Example of exit code from C program
$ cat error.c
#include
#include
int
main(int argc, char *argv)
{
exit(256+11);
}
$ gcc -o error error.c
$ echo ./error | qsub
180.xxx.yyy
$ qstat -f
Job Id: 180.xxx.yyy
Job_Name = STDIN
Job_Owner = test.xxx.yyy
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = xxx.yyy
Checkpoint = u
ctime = Wed Apr 30 11:29:37 2008
Error_Path = xxx.yyy:/home/test/STDIN.e180
exec_host = node01/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Apr 30 11:29:37 2008
Output_Path = xxx.yyy:/home/test/STDIN.o180
Priority = 0
qtime = Wed Apr 30 11:29:37 2008
Rerunable = True
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
session_id = 14107
substate = 59
Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=test,
PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s
bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,
PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,
PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,
PBS_O_QUEUE=batch
euser = test
egroup = test
hashname = 180.xxx.yyy
queue_rank = 8
queue_type = E
comment = Job started on Wed Apr 30 at 11:29
etime = Wed Apr 30 11:29:37 2008
exit_status = 11
start_time = Wed Apr 30 11:29:37 2008
start_count = 1
Notice that the C routine exit passes only the low order byte of its argument.
In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.