[torqueusers] Unknown Job Id Behavior
glen.beane at gmail.com
Tue Jun 10 17:58:03 MDT 2008
On Tue, Jun 10, 2008 at 2:21 PM, Joshua Bernstein <
jbernstein at penguincomputing.com> wrote:
> Joshua Bernstein wrote:
>> Chris Samuel wrote:
>>> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
>>> But I've noticed in 2.3 that we seem to be hitting the
>>>>> same problem described by the OP. :-(
>>>> Interesting. Are you running TORQUE in a diskless configuration like
>>>> I'm doing?
>>> Nope, ours have 4 x 300GB drives and keep state.
>>> Does that help or hinder ?
>> Doesn't help.
>> I still think there is a problem with some area of the communication
>> between pbs_mom and pbs_server.
>> If pbs_mom responds to pbs_server with a message saying that it doesn't
>> know anything about the job, shouldn't pbs_server just consider the job
>> dead, and either re-queue it or just notify the user?
> I'm _STILL_ having problems with this. I've tried running version 2.3.0,
> and had the same problem. pbs_mom seems to try to respond to pbs_server's
> request, but nothing changes. pbs_mom reports:
> pbs_mom;Req;;Type StatusJob request received from PBS_Server at master,
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0,
> type=StatusJob, from PBS_Server at master
> The interesting thing again, is that I'm running TORQUE's pbs_mom in a
> diskless configuration, so when a node reboots
> /var/spool/torque/mom_priv/jobs, is empty and no longer holds job
> information. Though if I NFS mount that directory to make it persistent,
> things seem to work.
> This all said, unless I'm not understanding something, I'm convinced that
> there is a bug here. When pbs_mom sends a 15001 error back to pbs_server,
> pbs_server should assume the job is dead and either requeue it, or simply
> declare the job dead.
I duplicated your diskless scenario by starting a job, shutting down
pbs_mom on that node, deleting the job files from the mom_priv/jobs
directory, then restarting pbs_mom. pbs_mom replied with the unknown job id
error and pbs_server basically ignored the error and kept the job state as R
So I think you are right, this is a bug. Perhaps if a job is "rerunnable"
then we requeue, otherwise say it is complete? I may take a quick crack at
just removing the job in this case, and then go from there.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers