[torquedev] Should a communication error between pbs_mom's kill a job ?
David.Singleton at anu.edu.au
Fri May 22 15:22:00 MDT 2009
Ken Nielson wrote:
>> I could be wrong but I believe the logic to kill jobs on network problems
>> was there because the MOM task management code was not setup for recovery
>> after MOM's lost connection. There are "corner case" race conditions etc
>> that are difficult to cover so the PBS logic was simply to avoid getting
>> confused. I know Torque has code to reconnect sisters but I dont know if
>> that handles all these task management issues. Maybe it does. Even if it
>> doesn't, its possible sites may not see any problems from reconnecting
>> Has anyone tried to stress test this? Lots of restarting sisters (with
>> long and short outages) in the presence of a variable tm load (lots of
>> pbsdsh's etc of varying lengths) with this mod? Throw job suspend/resume
>> and any other running job operation into the mix. The issue is not whether
>> MOMs crash but whether they get confused about job states.
> The PBS ERS has the following to say about communications failure between client and server.
> "Server Side Recovery - Failure of Client
> The following recovery procedures should be followed by a server receiving a job upon
> loss of communications with the client.
> 1. If the failure occurs before the Ready to Commit is received, the server discards the
> job. The client must restart from the beginning.
> 2. If the failure occurs after the Ready to Commit is received, the server should have
> recorded the job in permanent storage. The server keeps the job until (a) a request
> is received to delete it, or (b) the client resends the Read to Commit, Commit se-
> quence. If after a ‘‘site defined’’ period of time, the server has not received any di-
> rections, it may notify the batch administrator and request instruction."
> (PBS External Reference Specification sec. 11 pg. 9)
> This comes from section 22.214.171.124 Error Recovery of the PBS ERS doc. A communications failure should not kill a job on the sister. The sister should continue processing the job and later when communications are re-established the job information is updated on the server.
Note that 11.7.1 is about queuing jobs only and the error recovery
section is about making sure a job does not get dropped on the
floor during a queuing process.
More information about the torquedev