[torquedev] Should a communication error between pbs_mom's kill a job ?

David Singleton David.Singleton at anu.edu.au
Fri May 22 15:22:00 MDT 2009

Ken Nielson wrote:
>> I could be wrong but I believe the logic to kill jobs on network problems
>> was there because  the MOM task management code was not setup for recovery
>> after MOM's lost connection.  There are "corner case" race conditions etc
>> that are difficult to cover so the PBS logic was simply to avoid getting
>> confused. I know Torque has code to reconnect sisters but I dont know if
>> that handles all these task management issues. Maybe it does. Even if it
>> doesn't, its possible sites may not see any problems from reconnecting
>> sisters.
>> Has anyone tried to stress test this?  Lots of restarting sisters (with
>> long and short outages) in the presence of a variable tm load (lots of
>> pbsdsh's etc of varying lengths) with this mod?  Throw job suspend/resume
>> and any other running job operation into the mix.  The issue is not whether
>> MOMs crash but whether they get confused about job states.
> The PBS ERS has the following to say about communications failure between client and server.
> "Server Side Recovery - Failure of Client
>      The following recovery procedures should be followed by a server receiving a job upon
>      loss of communications with the client.
>      1.   If the failure occurs before the Ready to Commit is received, the server discards the
>           job. The client must restart from the beginning.
>      2.   If the failure occurs after the Ready to Commit is received, the server should have
>           recorded the job in permanent storage. The server keeps the job until (a) a request
>           is received to delete it, or (b) the client resends the Read to Commit, Commit se-
>           quence. If after a ‘‘site defined’’ period of time, the server has not received any di-
>           rections, it may notify the batch administrator and request instruction."
> (PBS External Reference Specification sec. 11 pg. 9)
> This comes from section Error Recovery of the PBS ERS doc. A communications failure should not kill a job on the sister. The sister should continue processing the job and later when communications are re-established the job information is updated on the server.

Note that 11.7.1 is about queuing jobs only and the error recovery
section is about making sure a job does not get dropped on the
floor during a queuing process.


