[torqueusers] Error requeuing job

Clotho Tsang wytsang at clustertech.com
Sun Nov 4 19:26:23 MST 2012


I reproduce the error message with following method:
*

   1. qsub a.sh → job id 30
   2. qsub -W depend=afterok:30 b.sh → job id 31
   3. /etc/init.d/pbs_mom stop
   4. rm /var/spool/torque/server_priv/jobs/30.mgmt.chess.*
   5. /etc/init.d/pbs_mom start
   6. Job 30 will be queued forever, 31 held.
   7. You will able to find error message in log
   "PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error),
   aux=0, type=RegisterDependency, "

*
See the detail log at the attachment.


On 1 November 2012 17:25, Clotho Tsang <wytsang at clustertech.com> wrote:

> I also find "Unknown Job Id Error" occasionally when I submit jobs with
> dependency.
> Everytime I find the case is related to dependency, but I am not able
> figure out how
> to reproduce it.
>
>
> On 30 September 2012 18:36, Rhys Hill <rhys.hill at adelaide.edu.au> wrote:
>
>> Hi everyone,
>>
>> I have a particular job that I run regularly as part of a development
>> project. On
>> the occasions where torque gets stuck, this particular job is always lost
>> when the
>> daemon is restarted, even though all the other jobs seem to return OK. I
>> always get
>> a message along these lines:
>>
>> Unable to requeue job, queue is not defined; job XXX queue batch
>>
>> where the qstat -q says:
>>
>> server: XXX
>>
>> Queue            Memory CPU Time Walltime Node  Run Que Lm  State
>> ---------------- ------ -------- -------- ----  --- --- --  -----
>> large              --      --    24:00:00   --    0   0 --   E R
>> long_running       --      --       --      --    0   0 --   E R
>> image_search       --      --       --      --    0   0 --   E R
>> batch              --      --    48:00:00   --  122   7 --   E R
>>                                                ----- -----
>>                                                  122     7
>>
>> so obviously the queue is actually there. I submit the jobs using a
>> script like this:
>>
>> ---
>>
>> #!/bin/sh
>> DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G
>> ./data_statistics.sh`
>>
>> JOBS=`ls */job.sh`
>> DEPS=;
>> for j in ${JOBS}; do
>>         JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W
>> afterok:${DS_JOB} -l vmem=18G $j`
>>         if [ "${DEPS}x" = "x" ]; then
>>                 DEPS="afterok:${JOB_ID}"
>>         else
>>                 DEPS="${DEPS},afterok:${JOB_ID}"
>>         fi
>> done
>> qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W
>> depend=${DEPS} ./run_report.sh
>>
>> ---
>>
>> ie. the data_statistics.sh job runs first, followed by several instances
>> of job.sh, then run_report.sh
>>
>> The server log looks like this in total:
>>
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch,
>> state 4 hop 1
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>> to requeue job, queue is not defined; job 6614.XXX queue batch
>> 09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
>> 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt,
>> Unable to abort Job 6614.XXX which was in substate 42
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch,
>> state RUNNING
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch,
>> state 1 hop 1
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>> to requeue job, queue is not defined; job 6615.XXX queue batch
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch,
>> state EXITING
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch,
>> state 1 hop 1
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>> to requeue job, queue is not defined; job 6616.XXX queue batch
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch,
>> state EXITING
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch,
>> state 2 hop 1
>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>> find work task for local request
>> 09/30/2012
>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>> to requeue job, queue is not defined; job 6617.XXX queue batch
>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch,
>> state EXITING
>>
>> we're using moab for scheduling, if that makes any difference.
>>
>> Any ideas?
>>
>> Cheers,
>>
>>
>> --------------------------------------------------------------------------------
>> Rhys Hill,                                             Senior Research
>> Associate
>> Australian Centre for Visual Technologies                 University of
>> Adelaide
>>
>> Phone: +61 8 8313 6197                           Mail:
>> Fax:   +61 8 8313 4366                           School of Computer
>> Science
>>                                                  University of Adelaide
>>                                                  Adelaide, Australia
>> http://www.cs.adelaide.edu.au/~rhys/             5005
>>
>> --------------------------------------------------------------------------------
>>
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
> --
> Clotho Tsang
> Senior Software Engineer
> Cluster Technology Limited
> Email: clotho at clustertech.com
> Tel: (852) 2655-6129
> Fax: (852) 2994-2101
> Website: www.clustertech.com
>
>


-- 
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho at clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121105/a556369d/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unknownjobid.zip
Type: application/zip
Size: 1202 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20121105/a556369d/attachment-0001.zip 


More information about the torqueusers mailing list