[torqueusers] Error requeuing job

Clotho Tsang wytsang at clustertech.com
Thu Nov 1 03:25:26 MDT 2012


I also find "Unknown Job Id Error" occasionally when I submit jobs with
dependency.
Everytime I find the case is related to dependency, but I am not able
figure out how
to reproduce it.

On 30 September 2012 18:36, Rhys Hill <rhys.hill at adelaide.edu.au> wrote:

> Hi everyone,
>
> I have a particular job that I run regularly as part of a development
> project. On
> the occasions where torque gets stuck, this particular job is always lost
> when the
> daemon is restarted, even though all the other jobs seem to return OK. I
> always get
> a message along these lines:
>
> Unable to requeue job, queue is not defined; job XXX queue batch
>
> where the qstat -q says:
>
> server: XXX
>
> Queue            Memory CPU Time Walltime Node  Run Que Lm  State
> ---------------- ------ -------- -------- ----  --- --- --  -----
> large              --      --    24:00:00   --    0   0 --   E R
> long_running       --      --       --      --    0   0 --   E R
> image_search       --      --       --      --    0   0 --   E R
> batch              --      --    48:00:00   --  122   7 --   E R
>                                                ----- -----
>                                                  122     7
>
> so obviously the queue is actually there. I submit the jobs using a script
> like this:
>
> ---
>
> #!/bin/sh
> DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G
> ./data_statistics.sh`
>
> JOBS=`ls */job.sh`
> DEPS=;
> for j in ${JOBS}; do
>         JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W
> afterok:${DS_JOB} -l vmem=18G $j`
>         if [ "${DEPS}x" = "x" ]; then
>                 DEPS="afterok:${JOB_ID}"
>         else
>                 DEPS="${DEPS},afterok:${JOB_ID}"
>         fi
> done
> qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W
> depend=${DEPS} ./run_report.sh
>
> ---
>
> ie. the data_statistics.sh job runs first, followed by several instances
> of job.sh, then run_report.sh
>
> The server log looks like this in total:
>
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch,
> state 4 hop 1
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
> to requeue job, queue is not defined; job 6614.XXX queue batch
> 09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
> 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt,
> Unable to abort Job 6614.XXX which was in substate 42
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch,
> state RUNNING
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch,
> state 1 hop 1
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
> to requeue job, queue is not defined; job 6615.XXX queue batch
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch,
> state EXITING
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch,
> state 1 hop 1
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
> to requeue job, queue is not defined; job 6616.XXX queue batch
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch,
> state EXITING
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch,
> state 2 hop 1
> 09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
> find work task for local request
> 09/30/2012
> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
> to requeue job, queue is not defined; job 6617.XXX queue batch
> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch,
> state EXITING
>
> we're using moab for scheduling, if that makes any difference.
>
> Any ideas?
>
> Cheers,
>
>
> --------------------------------------------------------------------------------
> Rhys Hill,                                             Senior Research
> Associate
> Australian Centre for Visual Technologies                 University of
> Adelaide
>
> Phone: +61 8 8313 6197                           Mail:
> Fax:   +61 8 8313 4366                           School of Computer Science
>                                                  University of Adelaide
>                                                  Adelaide, Australia
> http://www.cs.adelaide.edu.au/~rhys/             5005
>
> --------------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho at clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121101/53956fd3/attachment-0001.html 


More information about the torqueusers mailing list