[torqueusers] Unknown Job Id Behavior
jbernstein at penguincomputing.com
Tue Jun 10 12:21:13 MDT 2008
Joshua Bernstein wrote:
> Chris Samuel wrote:
>> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
>>>> But I've noticed in 2.3 that we seem to be hitting the
>>>> same problem described by the OP. :-(
>>> Interesting. Are you running TORQUE in a diskless configuration like
>>> I'm doing?
>> Nope, ours have 4 x 300GB drives and keep state.
>> Does that help or hinder ?
> Doesn't help.
> I still think there is a problem with some area of the communication
> between pbs_mom and pbs_server.
> If pbs_mom responds to pbs_server with a message saying that it doesn't
> know anything about the job, shouldn't pbs_server just consider the job
> dead, and either re-queue it or just notify the user?
I'm _STILL_ having problems with this. I've tried running version 2.3.0,
and had the same problem. pbs_mom seems to try to respond to
pbs_server's request, but nothing changes. pbs_mom reports:
pbs_mom;Req;;Type StatusJob request received from PBS_Server at master, sock=10
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0,
type=StatusJob, from PBS_Server at master
The interesting thing again, is that I'm running TORQUE's pbs_mom in a
diskless configuration, so when a node reboots
/var/spool/torque/mom_priv/jobs, is empty and no longer holds job
information. Though if I NFS mount that directory to make it persistent,
things seem to work.
This all said, unless I'm not understanding something, I'm convinced
that there is a bug here. When pbs_mom sends a 15001 error back to
pbs_server, pbs_server should assume the job is dead and either requeue
it, or simply declare the job dead.
Chris, You said you had this working correctly with a snapshot, which
snapshot are you running. The 2.4.0 from June 3, won't build due to a
Garrick, I would appreciate any more comments you might have on this.
What am I missing?
More information about the torqueusers