[torqueusers] Error requeuing job
Clotho Tsang
wytsang at clustertech.com
Mon Nov 19 22:33:26 MST 2012
At Torque 4.1.3 CHANGELOG, I find this:
e - Made it so pbs_server will come up even if a job cannot recover
because of a missing
job dependency. TRQ-1287
May be the problem is solved?
On 5 November 2012 10:26, Clotho Tsang <wytsang at clustertech.com> wrote:
> I reproduce the error message with following method:
> *
>
> 1. qsub a.sh → job id 30
> 2. qsub -W depend=afterok:30 b.sh → job id 31
> 3. /etc/init.d/pbs_mom stop
> 4. rm /var/spool/torque/server_priv/jobs/30.mgmt.chess.*
> 5. /etc/init.d/pbs_mom start
> 6. Job 30 will be queued forever, 31 held.
> 7. You will able to find error message in log
> "PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error),
> aux=0, type=RegisterDependency, "
>
> *
> See the detail log at the attachment.
>
>
>
> On 1 November 2012 17:25, Clotho Tsang <wytsang at clustertech.com> wrote:
>
>> I also find "Unknown Job Id Error" occasionally when I submit jobs with
>> dependency.
>> Everytime I find the case is related to dependency, but I am not able
>> figure out how
>> to reproduce it.
>>
>>
>> On 30 September 2012 18:36, Rhys Hill <rhys.hill at adelaide.edu.au> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a particular job that I run regularly as part of a development
>>> project. On
>>> the occasions where torque gets stuck, this particular job is always
>>> lost when the
>>> daemon is restarted, even though all the other jobs seem to return OK. I
>>> always get
>>> a message along these lines:
>>>
>>> Unable to requeue job, queue is not defined; job XXX queue batch
>>>
>>> where the qstat -q says:
>>>
>>> server: XXX
>>>
>>> Queue Memory CPU Time Walltime Node Run Que Lm State
>>> ---------------- ------ -------- -------- ---- --- --- -- -----
>>> large -- -- 24:00:00 -- 0 0 -- E R
>>> long_running -- -- -- -- 0 0 -- E R
>>> image_search -- -- -- -- 0 0 -- E R
>>> batch -- -- 48:00:00 -- 122 7 -- E R
>>> ----- -----
>>> 122 7
>>>
>>> so obviously the queue is actually there. I submit the jobs using a
>>> script like this:
>>>
>>> ---
>>>
>>> #!/bin/sh
>>> DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G
>>> ./data_statistics.sh`
>>>
>>> JOBS=`ls */job.sh`
>>> DEPS=;
>>> for j in ${JOBS}; do
>>> JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W
>>> afterok:${DS_JOB} -l vmem=18G $j`
>>> if [ "${DEPS}x" = "x" ]; then
>>> DEPS="afterok:${JOB_ID}"
>>> else
>>> DEPS="${DEPS},afterok:${JOB_ID}"
>>> fi
>>> done
>>> qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W
>>> depend=${DEPS} ./run_report.sh
>>>
>>> ---
>>>
>>> ie. the data_statistics.sh job runs first, followed by several instances
>>> of job.sh, then run_report.sh
>>>
>>> The server log looks like this in total:
>>>
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch,
>>> state 4 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6614.XXX queue batch
>>> 09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
>>> 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt,
>>> Unable to abort Job 6614.XXX which was in substate 42
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch,
>>> state RUNNING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch,
>>> state 1 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6615.XXX queue batch
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch,
>>> state EXITING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch,
>>> state 1 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6616.XXX queue batch
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch,
>>> state EXITING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch,
>>> state 2 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6617.XXX queue batch
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch,
>>> state EXITING
>>>
>>> we're using moab for scheduling, if that makes any difference.
>>>
>>> Any ideas?
>>>
>>> Cheers,
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> Rhys Hill, Senior Research
>>> Associate
>>> Australian Centre for Visual Technologies University of
>>> Adelaide
>>>
>>> Phone: +61 8 8313 6197 Mail:
>>> Fax: +61 8 8313 4366 School of Computer
>>> Science
>>> University of Adelaide
>>> Adelaide, Australia
>>> http://www.cs.adelaide.edu.au/~rhys/ 5005
>>>
>>> --------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>>
>> --
>> Clotho Tsang
>> Senior Software Engineer
>> Cluster Technology Limited
>> Email: clotho at clustertech.com
>> Tel: (852) 2655-6129
>> Fax: (852) 2994-2101
>> Website: www.clustertech.com
>>
>>
>
>
> --
> Clotho Tsang
> Senior Software Engineer
> Cluster Technology Limited
> Email: clotho at clustertech.com
> Tel: (852) 2655-6129
> Fax: (852) 2994-2101
> Website: www.clustertech.com
>
>
--
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho at clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121120/7e032a4c/attachment.html
More information about the torqueusers
mailing list