[torqueusers] Exit_status always 0, a BUG??
Prabhakar R Gudla
gudlap at mail.nih.gov
Wed Nov 21 10:21:17 MST 2012
Hi,
My apologies if this message gets posted twice.
Issue:
We are trying to get the "exit_status" of our jobs on our cluster
(CentOS 6.3, x86_64 with Torque 4.0.0 and pbs_sched).
Everything looks fine, except that the "exit_status" is messed up and is
not what we expect. For instance, take the test PBS job (simple_job.sh)
with an exit code 11. The job executes fine, we get the expected STDOUT
and STDERR. However, the "exit_status" code is always "0" either using
"qstat" or "tracejob".
What could be wrong?
$ cat simple_job.sh
------------------------------------------------------------------
#!/bin/bash
#PBS -N TorqueTest
#PBS -l nodes=1,walltime=00:01:00
#PBS -M xx at yyy.com
#PBS -m abe
#print the time and date
date
#wait 10 seconds
sleep 10
#print the time and date again
date
# Exit code
exit 11
------------------------------------------------------------------
$ qstat -f <jobid>
See qstat_out.txt
$ tracejob <jobid>
See tracejob_out.txt
What could be going wrong?
Thanks,
PRG
-------------- next part --------------
Job: 165.<><>
11/21/2012 00:58:18 S Job Queued at request of <>, owner = <>@<><>, job name = TorqueTest, queue = batchq
11/21/2012 00:58:18 S Job Modified at request of Scheduler@<><>
11/21/2012 00:58:18 L Job Run
11/21/2012 00:58:18 S enqueuing into batchq, state 1 hop 1
11/21/2012 00:58:18 S Job Run at request of Scheduler@<><>
11/21/2012 00:58:18 S child reported success for job after 0 seconds (dest=???), rc=0
11/21/2012 00:58:18 A queue=batchq
11/21/2012 00:58:18 A user=<> group=domain_users jobname=TorqueTest queue=batchq ctime=1353477498 qtime=1353477498 etime=1353477498 start=1353477498 owner=<>@<><> exec_host=<><>/0
Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00
11/21/2012 00:58:28 S obit received - updating final job usage info
11/21/2012 00:58:28 S job exit status 0 handled
11/21/2012 00:58:28 S preparing to send 'e' mail for job 165.<><> to <>@<><> (Exit_status=0)
11/21/2012 00:58:28 S Exit_status=0
11/21/2012 00:58:28 S on_job_exit valid pjob: 0x7f1a080008c0 (substate=50)
11/21/2012 00:58:28 S JOB_SUBSTATE_EXITING
11/21/2012 00:58:28 S on_job_exit valid pjob: 0x7f1a080008c0 (substate=52)
11/21/2012 00:58:28 A user=<> group=<> jobname=TorqueTest queue=batchq ctime=1353477498 qtime=1353477498 etime=1353477498 start=1353477498 owner=<>@<><> exec_host=<><>/0
Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=1942 end=1353477508 Exit_status=0
11/21/2012 00:59:28 S on_job_exit valid pjob: 0x7f1a080008c0 (substate=59)
11/21/2012 00:59:28 S dequeuing from batchq, state COMPLETE
-------------- next part --------------
Job Id: 166.<>
Job_Name = TorqueTest
Job_Owner = <>@<>
resources_used.cput = 00:00:00
resources_used.mem = 3176kb
resources_used.vmem = 333788kb
resources_used.walltime = 00:00:12
job_state = E
queue = batchq
server = <>
Checkpoint = u
ctime = Wed Nov 21 01:06:24 2012
Error_Path = <>:/mnt/xfer_scratch/torque_testing/
TorqueTest.e166
exec_host = <>/0
exec_port = 15003
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = abe
Mail_Users = <>@<><><>
mtime = Wed Nov 21 01:06:36 2012
Output_Path = <>:/mnt/xfer_scratch/torque_testing/TorqueTest.o166
Priority = 0
qtime = Wed Nov 21 01:06:24 2012
Rerunable = True
Resource_List.ncpus = 1
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:01:00
session_id = 2110
substate = 52
Variable_List = PBS_O_QUEUE=batchq,PBS_O_HOME=/,
PBS_O_WORKDIR=/mnt/xfer_scratch/torque_testing,
PBS_O_HOST=<>,
PBS_O_SERVER=<>,
PBS_O_WORKDIR=/mnt/xfer_scratch/torque_testing
euser = <>
egroup = <>
hashname = 166.<>
queue_rank = 2
queue_type = E
comment = Job started on Wed Nov 21 at 01:06
etime = Wed Nov 21 01:06:24 2012
exit_status = 0
submit_args = ./simple_job.sh
start_time = Wed Nov 21 01:06:24 2012
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = <>
init_work_dir = /mnt/xfer_scratch/torque_testing
More information about the torqueusers
mailing list