[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Tue Jun 10 18:19:30 MDT 2008



Glen Beane wrote:
> 
> 
> On Tue, Jun 10, 2008 at 2:21 PM, Joshua Bernstein 
> <jbernstein at penguincomputing.com 
> <mailto:jbernstein at penguincomputing.com>> wrote:
> 
> 
> 
>     Joshua Bernstein wrote:
> 
> 
> 
>         Chris Samuel wrote:
> 
>             ----- "Joshua Bernstein" <jbernstein at penguincomputing.com
>             <mailto:jbernstein at penguincomputing.com>> wrote:
> 
>                     But I've noticed in 2.3 that we seem to be hitting the
>                     same problem described by the OP.  :-(
> 
>                 Interesting. Are you running TORQUE in a diskless
>                 configuration like
>                 I'm doing?
> 
> 
>             Nope, ours have 4 x 300GB drives and keep state.
> 
>             Does that help or hinder ?
> 
> 
>         Doesn't help.
> 
>         I still think there is a problem with some area of the
>         communication between pbs_mom and pbs_server.
> 
>         If pbs_mom responds to pbs_server with a message saying that it
>         doesn't know anything about the job, shouldn't pbs_server just
>         consider the job dead, and either re-queue it or just notify the
>         user?
> 
> 
>     I'm _STILL_ having problems with this. I've tried running version
>     2.3.0, and had the same problem. pbs_mom seems to try to respond to
>     pbs_server's request, but nothing changes. pbs_mom reports:
> 
>     pbs_mom;Req;;Type StatusJob request received from PBS_Server at master,
>     sock=10
>     ...
>     pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id),
>     aux=0, type=StatusJob, from PBS_Server at master
> 
>     The interesting thing again, is that I'm running TORQUE's pbs_mom in
>     a diskless configuration, so when a node reboots
>     /var/spool/torque/mom_priv/jobs, is empty and no longer holds job
>     information. Though if I NFS mount that directory to make it
>     persistent,  things seem to work.
> 
>     This all said, unless I'm not understanding something, I'm convinced
>     that there is a bug here. When pbs_mom sends a 15001 error back to
>     pbs_server, pbs_server should assume the job is dead and either
>     requeue it, or simply declare the job dead.
> 
> 
> I duplicated your diskless scenario by starting a job,  shutting down 
> pbs_mom on that node, deleting the job files from the mom_priv/jobs 
> directory, then restarting pbs_mom.  pbs_mom replied with the unknown 
> job id error and pbs_server basically ignored the error and kept the job 
> state as R
> 
> So I think you are right,  this is a bug.   Perhaps if a job is 
> "rerunnable" then we requeue, otherwise say it is complete?  I may take 
> a quick crack at just removing the job in this case, and then go from there.

Excellent! I would really appreciate you taking a crack at this. If 
there is anything I can do to help, or if you can give me a starting 
point, I can dig through the code myself and have a look as well.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list