[torqueusers] Error requeuing job

Rhys Hill rhys.hill at adelaide.edu.au
Sun Sep 30 04:36:14 MDT 2012


Hi everyone,

I have a particular job that I run regularly as part of a development project. On
the occasions where torque gets stuck, this particular job is always lost when the
daemon is restarted, even though all the other jobs seem to return OK. I always get
a message along these lines:

Unable to requeue job, queue is not defined; job XXX queue batch

where the qstat -q says:

server: XXX

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
large              --      --    24:00:00   --    0   0 --   E R
long_running       --      --       --      --    0   0 --   E R
image_search       --      --       --      --    0   0 --   E R
batch              --      --    48:00:00   --  122   7 --   E R
                                               ----- -----
                                                 122     7

so obviously the queue is actually there. I submit the jobs using a script like this:

---

#!/bin/sh
DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G ./data_statistics.sh`

JOBS=`ls */job.sh`
DEPS=;
for j in ${JOBS}; do
        JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W afterok:${DS_JOB} -l vmem=18G $j`
        if [ "${DEPS}x" = "x" ]; then
                DEPS="afterok:${JOB_ID}"
        else
            	DEPS="${DEPS},afterok:${JOB_ID}"
        fi
done
qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W depend=${DEPS} ./run_report.sh

---

ie. the data_statistics.sh job runs first, followed by several instances of job.sh, then run_report.sh

The server log looks like this in total:

09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch, state 4 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6614.XXX queue batch
09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt, Unable to abort Job 6614.XXX which was in substate 42
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch, state RUNNING
09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch, state 1 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6615.XXX queue batch
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch, state EXITING
09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch, state 1 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6616.XXX queue batch
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch, state EXITING
09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch, state 2 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6617.XXX queue batch
09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch, state EXITING

we're using moab for scheduling, if that makes any difference.

Any ideas?

Cheers,

--------------------------------------------------------------------------------
Rhys Hill,                                             Senior Research Associate
Australian Centre for Visual Technologies                 University of Adelaide

Phone: +61 8 8313 6197                           Mail:
Fax:   +61 8 8313 4366                           School of Computer Science
                                                 University of Adelaide
                                                 Adelaide, Australia
http://www.cs.adelaide.edu.au/~rhys/             5005
--------------------------------------------------------------------------------



More information about the torqueusers mailing list