[torqueusers] Error requeuing job

Clotho Tsang wytsang at clustertech.com
Mon Nov 19 22:33:26 MST 2012


At Torque 4.1.3 CHANGELOG, I find this:

  e - Made it so pbs_server will come up even if a job cannot recover
because of a missing
      job dependency. TRQ-1287

May be the problem is solved?

On 5 November 2012 10:26, Clotho Tsang <wytsang at clustertech.com> wrote:

> I reproduce the error message with following method:
> *
>
>    1. qsub a.sh → job id 30
>    2. qsub -W depend=afterok:30 b.sh → job id 31
>    3. /etc/init.d/pbs_mom stop
>    4. rm /var/spool/torque/server_priv/jobs/30.mgmt.chess.*
>    5. /etc/init.d/pbs_mom start
>    6. Job 30 will be queued forever, 31 held.
>    7. You will able to find error message in log
>    "PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id Error),
>    aux=0, type=RegisterDependency, "
>
> *
> See the detail log at the attachment.
>
>
>
> On 1 November 2012 17:25, Clotho Tsang <wytsang at clustertech.com> wrote:
>
>> I also find "Unknown Job Id Error" occasionally when I submit jobs with
>> dependency.
>> Everytime I find the case is related to dependency, but I am not able
>> figure out how
>> to reproduce it.
>>
>>
>> On 30 September 2012 18:36, Rhys Hill <rhys.hill at adelaide.edu.au> wrote:
>>
>>> Hi everyone,
>>>
>>> I have a particular job that I run regularly as part of a development
>>> project. On
>>> the occasions where torque gets stuck, this particular job is always
>>> lost when the
>>> daemon is restarted, even though all the other jobs seem to return OK. I
>>> always get
>>> a message along these lines:
>>>
>>> Unable to requeue job, queue is not defined; job XXX queue batch
>>>
>>> where the qstat -q says:
>>>
>>> server: XXX
>>>
>>> Queue            Memory CPU Time Walltime Node  Run Que Lm  State
>>> ---------------- ------ -------- -------- ----  --- --- --  -----
>>> large              --      --    24:00:00   --    0   0 --   E R
>>> long_running       --      --       --      --    0   0 --   E R
>>> image_search       --      --       --      --    0   0 --   E R
>>> batch              --      --    48:00:00   --  122   7 --   E R
>>>                                                ----- -----
>>>                                                  122     7
>>>
>>> so obviously the queue is actually there. I submit the jobs using a
>>> script like this:
>>>
>>> ---
>>>
>>> #!/bin/sh
>>> DS_JOB=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G
>>> ./data_statistics.sh`
>>>
>>> JOBS=`ls */job.sh`
>>> DEPS=;
>>> for j in ${JOBS}; do
>>>         JOB_ID=`qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -W
>>> afterok:${DS_JOB} -l vmem=18G $j`
>>>         if [ "${DEPS}x" = "x" ]; then
>>>                 DEPS="afterok:${JOB_ID}"
>>>         else
>>>                 DEPS="${DEPS},afterok:${JOB_ID}"
>>>         fi
>>> done
>>> qsub -l walltime=24:00:00 -l nodes=1:type2:ppn=16 -l vmem=18G -W
>>> depend=${DEPS} ./run_report.sh
>>>
>>> ---
>>>
>>> ie. the data_statistics.sh job runs first, followed by several instances
>>> of job.sh, then run_report.sh
>>>
>>> The server log looks like this in total:
>>>
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;enqueuing into batch,
>>> state 4 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6614.XXX queue batch
>>> 09/30/2012 19:02:55;0001;PBS_Server;Req;;Server could not connect to MOM
>>> 09/30/2012 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job_abt,
>>> Unable to abort Job 6614.XXX which was in substate 42
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6614.XXX;dequeuing from batch,
>>> state RUNNING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;enqueuing into batch,
>>> state 1 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6615.XXX queue batch
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6615.XXX;dequeuing from batch,
>>> state EXITING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;enqueuing into batch,
>>> state 1 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6616.XXX queue batch
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6617.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6616.XXX;dequeuing from batch,
>>> state EXITING
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;enqueuing into batch,
>>> state 2 hop 1
>>> 09/30/2012 19:02:55;0080;PBS_Server;Job;6614.XXX;Unknown Job Id Error
>>> 09/30/2012 19:02:55;0080;PBS_Server;Req;req_reject;Reject reply
>>> code=15001(Unknown Job Id Error), aux=0, type=RegisterDependency, from @XXX
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::reply_send_svr, did not
>>> find work task for local request
>>> 09/30/2012
>>> 19:02:55;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::pbsd_init_reque, Unable
>>> to requeue job, queue is not defined; job 6617.XXX queue batch
>>> 09/30/2012 19:02:55;0100;PBS_Server;Job;6617.XXX;dequeuing from batch,
>>> state EXITING
>>>
>>> we're using moab for scheduling, if that makes any difference.
>>>
>>> Any ideas?
>>>
>>> Cheers,
>>>
>>>
>>> --------------------------------------------------------------------------------
>>> Rhys Hill,                                             Senior Research
>>> Associate
>>> Australian Centre for Visual Technologies                 University of
>>> Adelaide
>>>
>>> Phone: +61 8 8313 6197                           Mail:
>>> Fax:   +61 8 8313 4366                           School of Computer
>>> Science
>>>                                                  University of Adelaide
>>>                                                  Adelaide, Australia
>>> http://www.cs.adelaide.edu.au/~rhys/             5005
>>>
>>> --------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> torqueusers mailing list
>>> torqueusers at supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>>
>>
>>
>>
>> --
>> Clotho Tsang
>> Senior Software Engineer
>> Cluster Technology Limited
>> Email: clotho at clustertech.com
>> Tel: (852) 2655-6129
>> Fax: (852) 2994-2101
>> Website: www.clustertech.com
>>
>>
>
>
> --
> Clotho Tsang
> Senior Software Engineer
> Cluster Technology Limited
> Email: clotho at clustertech.com
> Tel: (852) 2655-6129
> Fax: (852) 2994-2101
> Website: www.clustertech.com
>
>


-- 
Clotho Tsang
Senior Software Engineer
Cluster Technology Limited
Email: clotho at clustertech.com
Tel: (852) 2655-6129
Fax: (852) 2994-2101
Website: www.clustertech.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121120/7e032a4c/attachment.html 


More information about the torqueusers mailing list