[torqueusers] Unknown Job Id Behavior

Joshua Bernstein jbernstein at penguincomputing.com
Tue Jun 10 12:21:13 MDT 2008



Joshua Bernstein wrote:
> 
> 
> Chris Samuel wrote:
>> ----- "Joshua Bernstein" <jbernstein at penguincomputing.com> wrote:
>>
>>>> But I've noticed in 2.3 that we seem to be hitting the
>>>> same problem described by the OP.  :-( 
>>> Interesting. Are you running TORQUE in a diskless configuration like
>>> I'm doing?
>>
>> Nope, ours have 4 x 300GB drives and keep state.
>>
>> Does that help or hinder ?
> 
> Doesn't help.
> 
> I still think there is a problem with some area of the communication 
> between pbs_mom and pbs_server.
> 
> If pbs_mom responds to pbs_server with a message saying that it doesn't 
> know anything about the job, shouldn't pbs_server just consider the job 
> dead, and either re-queue it or just notify the user?

I'm _STILL_ having problems with this. I've tried running version 2.3.0, 
and had the same problem. pbs_mom seems to try to respond to 
pbs_server's request, but nothing changes. pbs_mom reports:

pbs_mom;Req;;Type StatusJob request received from PBS_Server at master, sock=10
...
pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, 
type=StatusJob, from PBS_Server at master

The interesting thing again, is that I'm running TORQUE's pbs_mom in a 
diskless configuration, so when a node reboots 
/var/spool/torque/mom_priv/jobs, is empty and no longer holds job 
information. Though if I NFS mount that directory to make it persistent, 
  things seem to work.

This all said, unless I'm not understanding something, I'm convinced 
that there is a bug here. When pbs_mom sends a 15001 error back to 
pbs_server, pbs_server should assume the job is dead and either requeue 
it, or simply declare the job dead.

Chris, You said you had this working correctly with a snapshot, which 
snapshot are you running. The 2.4.0 from June 3, won't build due to a 
bad cast.

Garrick, I would appreciate any more comments you might have on this. 
What am I missing?

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torqueusers mailing list