[torqueusers] Error requeuing job
Rhys Hill
rhys.hill at adelaide.edu.au
Sun Sep 30 04:36:14 MDT 2012
Hi everyone,
I have a particular job that I run regularly as part of a development project. On
the occasions where torque gets stuck, this particular job is always lost when the
daemon is restarted, even though all the other jobs seem to return OK. I always get
a message along these lines:
Unable to requeue job, queue is not defined; job XXX queue batch
where the qstat -q says:
server: XXX
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
large -- -- 24:00:00 -- 0 0 -- E R
long_running -- -- -- -- 0 0 -- E R
image_search -- -- -- -- 0 0 -- E R
batch -- -- 48:00:00 -- 122 7 -- E R
----- -----
122 7
so obviously the queue is actually there. I submit the jobs using a script like this:
---
#!/bin/sh
DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G ./data_statistics.sh`
JOBS=`ls */job.sh`
DEPS=;
for j in ${JOBS}; do
JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W afterok:${DS_JOB} -l vmem=18G $j`
if [ "${DEPS}x" = "x" ]; then
DEPS="afterok:${JOB_ID}"
else
DEPS="${DEPS},afterok:${JOB_ID}"
fi
done
qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W depend=${DEPS} ./run_report.sh
---
ie. the data_statistics.sh job runs first, followed by several instances of job.sh, then run_report.sh
The server log looks like this in total:
09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch, state 4 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6614.XXX queue batch
09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt, Unable to abort Job 6614.XXX which was in substate 42
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch, state RUNNING
09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch, state 1 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6615.XXX queue batch
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch, state EXITING
09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch, state 1 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6616.XXX queue batch
09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch, state EXITING
09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch, state 2 hop 1
09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not find work task for local request
09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable to requeue job, queue is not defined; job 6617.XXX queue batch
09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch, state EXITING
we're using moab for scheduling, if that makes any difference.
Any ideas?
Cheers,
--------------------------------------------------------------------------------
Rhys Hill, Senior Research Associate
Australian Centre for Visual Technologies University of Adelaide
Phone: +61 8 8313 6197 Mail:
Fax: +61 8 8313 4366 School of Computer Science
University of Adelaide
Adelaide, Australia
http://www.cs.adelaide.edu.au/~rhys/ 5005
--------------------------------------------------------------------------------
More information about the torqueusers
mailing list