[torqueusers] Unknown Job Id Behavior

Glen Beane glen.beane at gmail.com
Tue Jun 10 17:58:03 MDT 2008


On Tue, Jun 10, 2008 at 2:21 PM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:

>
>
> Joshua Bernstein wrote:
>
>>
>>
>> Chris Samuel wrote:
>>
>>> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
>>>
>>>  But I've noticed in 2.3 that we seem to be hitting the
>>>>> same problem described by the OP.  :-(
>>>>>
>>>> Interesting. Are you running TORQUE in a diskless configuration like
>>>> I'm doing?
>>>>
>>>
>>> Nope, ours have 4 x 300GB drives and keep state.
>>>
>>> Does that help or hinder ?
>>>
>>
>> Doesn't help.
>>
>> I still think there is a problem with some area of the communication
>> between pbs_mom and pbs_server.
>>
>> If pbs_mom responds to pbs_server with a message saying that it doesn't
>> know anything about the job, shouldn't pbs_server just consider the job
>> dead, and either re-queue it or just notify the user?
>>
>
> I'm _STILL_ having problems with this. I've tried running version 2.3.0,
> and had the same problem. pbs_mom seems to try to respond to pbs_server's
> request, but nothing changes. pbs_mom reports:
>
> pbs_mom;Req;;Type StatusJob request received from PBS_Server at master,
> sock=10
> ...
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0,
> type=StatusJob, from PBS_Server at master
>
> The interesting thing again, is that I'm running TORQUE's pbs_mom in a
> diskless configuration, so when a node reboots
> /var/spool/torque/mom_priv/jobs, is empty and no longer holds job
> information. Though if I NFS mount that directory to make it persistent,
>  things seem to work.
>
> This all said, unless I'm not understanding something, I'm convinced that
> there is a bug here. When pbs_mom sends a 15001 error back to pbs_server,
> pbs_server should assume the job is dead and either requeue it, or simply
> declare the job dead.


I duplicated your diskless scenario by starting a job,  shutting down
pbs_mom on that node, deleting the job files from the mom_priv/jobs
directory, then restarting pbs_mom.  pbs_mom replied with the unknown job id
error and pbs_server basically ignored the error and kept the job state as R

So I think you are right,  this is a bug.   Perhaps if a job is "rerunnable"
then we requeue, otherwise say it is complete?  I may take a quick crack at
just removing the job in this case, and then go from there.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080610/e26456d6/attachment.html


More information about the torqueusers mailing list