[torquedev] Should a communication error between pbs_mom's kill a job ?

Ken Nielson knielson at clusterresources.com
Fri May 22 10:25:38 MDT 2009

>I could be wrong but I believe the logic to kill jobs on network problems
>was there because  the MOM task management code was not setup for recovery
>after MOM's lost connection.  There are "corner case" race conditions etc
>that are difficult to cover so the PBS logic was simply to avoid getting
>confused. I know Torque has code to reconnect sisters but I dont know if
>that handles all these task management issues. Maybe it does. Even if it
>doesn't, its possible sites may not see any problems from reconnecting
>Has anyone tried to stress test this?  Lots of restarting sisters (with
>long and short outages) in the presence of a variable tm load (lots of
>pbsdsh's etc of varying lengths) with this mod?  Throw job suspend/resume
>and any other running job operation into the mix.  The issue is not whether
>MOMs crash but whether they get confused about job states.

The PBS ERS has the following to say about communications failure between client and server.

"Server Side Recovery - Failure of Client
     The following recovery procedures should be followed by a server receiving a job upon
     loss of communications with the client.
     1.   If the failure occurs before the Ready to Commit is received, the server discards the
          job. The client must restart from the beginning.
     2.   If the failure occurs after the Ready to Commit is received, the server should have
          recorded the job in permanent storage. The server keeps the job until (a) a request
          is received to delete it, or (b) the client resends the Read to Commit, Commit se-
          quence. If after a ‘‘site defined’’ period of time, the server has not received any di-
          rections, it may notify the batch administrator and request instruction."
(PBS External Reference Specification sec. 11 pg. 9)

This comes from section Error Recovery of the PBS ERS doc. A communications failure should not kill a job on the sister. The sister should continue processing the job and later when communications are re-established the job information is updated on the server.

torquedev mailing list
torquedev at supercluster.org

More information about the torquedev mailing list