[torqueusers] Unknown Job Id Behavior
Joshua Bernstein
jbernstein at penguincomputing.com
Tue Jun 10 18:19:30 MDT 2008
Glen Beane wrote:
>
>
> On Tue, Jun 10, 2008 at 2:21 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com
> <mailto:jbernstein at penguincomputing.com>> wrote:
>
>
>
> Joshua Bernstein wrote:
>
>
>
> Chris Samuel wrote:
>
> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com
> <mailto:jbernstein at penguincomputing.com>> wrote:
>
> But I've noticed in 2.3 that we seem to be hitting the
> same problem described by the OP. :-(
>
> Interesting. Are you running TORQUE in a diskless
> configuration like
> I'm doing?
>
>
> Nope, ours have 4 x 300GB drives and keep state.
>
> Does that help or hinder ?
>
>
> Doesn't help.
>
> I still think there is a problem with some area of the
> communication between pbs_mom and pbs_server.
>
> If pbs_mom responds to pbs_server with a message saying that it
> doesn't know anything about the job, shouldn't pbs_server just
> consider the job dead, and either re-queue it or just notify the
> user?
>
>
> I'm _STILL_ having problems with this. I've tried running version
> 2.3.0, and had the same problem. pbs_mom seems to try to respond to
> pbs_server's request, but nothing changes. pbs_mom reports:
>
> pbs_mom;Req;;Type StatusJob request received from PBS_Server at master,
> sock=10
> ...
> pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id),
> aux=0, type=StatusJob, from PBS_Server at master
>
> The interesting thing again, is that I'm running TORQUE's pbs_mom in
> a diskless configuration, so when a node reboots
> /var/spool/torque/mom_priv/jobs, is empty and no longer holds job
> information. Though if I NFS mount that directory to make it
> persistent, things seem to work.
>
> This all said, unless I'm not understanding something, I'm convinced
> that there is a bug here. When pbs_mom sends a 15001 error back to
> pbs_server, pbs_server should assume the job is dead and either
> requeue it, or simply declare the job dead.
>
>
> I duplicated your diskless scenario by starting a job, shutting down
> pbs_mom on that node, deleting the job files from the mom_priv/jobs
> directory, then restarting pbs_mom. pbs_mom replied with the unknown
> job id error and pbs_server basically ignored the error and kept the job
> state as R
>
> So I think you are right, this is a bug. Perhaps if a job is
> "rerunnable" then we requeue, otherwise say it is complete? I may take
> a quick crack at just removing the job in this case, and then go from there.
Excellent! I would really appreciate you taking a crack at this. If
there is anything I can do to help, or if you can give me a starting
point, I can dig through the code myself and have a look as well.
-Joshua Bernstein
Software Engineer
Penguin Computing
More information about the torqueusers
mailing list