[torqueusers] Exit_status always 0, a BUG??

Prabhakar R Gudla gudlap at mail.nih.gov
Wed Nov 21 10:21:17 MST 2012


Hi,

My apologies if this message gets posted twice.


Issue:
We are trying to get the "exit_status" of our jobs on our cluster
(CentOS 6.3, x86_64 with Torque 4.0.0 and pbs_sched).

Everything looks fine, except that the "exit_status" is messed up and is
not what we expect. For instance, take the test PBS job (simple_job.sh)
with an exit code 11.  The job executes fine, we get the expected STDOUT
and STDERR. However, the "exit_status" code is always "0" either using
"qstat" or "tracejob".

What could be wrong?


$ cat simple_job.sh
------------------------------------------------------------------
#!/bin/bash
#PBS -N TorqueTest
#PBS -l nodes=1,walltime=00:01:00
#PBS -M xx at yyy.com
#PBS -m abe
#print the time and date
date
#wait 10 seconds
sleep 10
#print the time and date again
date
# Exit code
exit 11
------------------------------------------------------------------

$ qstat -f <jobid>

See qstat_out.txt


$ tracejob <jobid>

See tracejob_out.txt

What could be going wrong?

Thanks,

PRG

-------------- next part --------------
Job: 165.<><>

11/21/2012 00:58:18  S    Job Queued at request of <>, owner = <>@<><>, job name = TorqueTest, queue = batchq
11/21/2012 00:58:18  S    Job Modified at request of Scheduler@<><>
11/21/2012 00:58:18  L    Job Run
11/21/2012 00:58:18  S    enqueuing into batchq, state 1 hop 1
11/21/2012 00:58:18  S    Job Run at request of Scheduler@<><>
11/21/2012 00:58:18  S    child reported success for job after 0 seconds (dest=???), rc=0
11/21/2012 00:58:18  A    queue=batchq
11/21/2012 00:58:18  A    user=<> group=domain_users jobname=TorqueTest queue=batchq ctime=1353477498 qtime=1353477498 etime=1353477498 start=1353477498 owner=<>@<><> exec_host=<><>/0
                          Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00
11/21/2012 00:58:28  S    obit received - updating final job usage info
11/21/2012 00:58:28  S    job exit status 0 handled
11/21/2012 00:58:28  S    preparing to send 'e' mail for job 165.<><> to <>@<><> (Exit_status=0)
11/21/2012 00:58:28  S    Exit_status=0
11/21/2012 00:58:28  S    on_job_exit valid pjob: 0x7f1a080008c0 (substate=50)
11/21/2012 00:58:28  S    JOB_SUBSTATE_EXITING
11/21/2012 00:58:28  S    on_job_exit valid pjob: 0x7f1a080008c0 (substate=52)
11/21/2012 00:58:28  A    user=<> group=<> jobname=TorqueTest queue=batchq ctime=1353477498 qtime=1353477498 etime=1353477498 start=1353477498 owner=<>@<><> exec_host=<><>/0
                          Resource_List.ncpus=1 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:00 session=1942 end=1353477508 Exit_status=0
11/21/2012 00:59:28  S    on_job_exit valid pjob: 0x7f1a080008c0 (substate=59)
11/21/2012 00:59:28  S    dequeuing from batchq, state COMPLETE
-------------- next part --------------
Job Id: 166.<>
    Job_Name = TorqueTest
    Job_Owner = <>@<>
    resources_used.cput = 00:00:00
    resources_used.mem = 3176kb
    resources_used.vmem = 333788kb
    resources_used.walltime = 00:00:12
    job_state = E
    queue = batchq
    server = <>
    Checkpoint = u
    ctime = Wed Nov 21 01:06:24 2012
    Error_Path = <>:/mnt/xfer_scratch/torque_testing/
        TorqueTest.e166
    exec_host = <>/0
    exec_port = 15003
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = abe
    Mail_Users = <>@<><><>
    mtime = Wed Nov 21 01:06:36 2012
    Output_Path = <>:/mnt/xfer_scratch/torque_testing/TorqueTest.o166
    Priority = 0
    qtime = Wed Nov 21 01:06:24 2012
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.neednodes = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 00:01:00
    session_id = 2110
    substate = 52
    Variable_List = PBS_O_QUEUE=batchq,PBS_O_HOME=/,
        PBS_O_WORKDIR=/mnt/xfer_scratch/torque_testing,
        PBS_O_HOST=<>,
        PBS_O_SERVER=<>,
        PBS_O_WORKDIR=/mnt/xfer_scratch/torque_testing
    euser = <>
    egroup = <>
    hashname = 166.<>
    queue_rank = 2
    queue_type = E
    comment = Job started on Wed Nov 21 at 01:06
    etime = Wed Nov 21 01:06:24 2012
    exit_status = 0
    submit_args = ./simple_job.sh
    start_time = Wed Nov 21 01:06:24 2012
    start_count = 1
    fault_tolerant = False
    job_radix = 0
    submit_host = <>
    init_work_dir = /mnt/xfer_scratch/torque_testing



More information about the torqueusers mailing list