[torquedev] Should a communication error between pbs_mom's kill a job ?

Glen Beane glen.beane at gmail.com
Mon May 18 13:02:10 MDT 2009

On Mon, May 18, 2009 at 2:57 PM, Joshua Bernstein
<jbernstein at penguincomputing.com> wrote:
> Glen Beane wrote:
>> by the way,  I was already working on a job attribute called
>> "fault_tolerant" that prevents TORQUE from killing a job if a sister
>> node goes down.  I've just about wrapped this up.  A system admin
>> could set the default value of this to true (I was going to make this
>> a torque.cfg option)
> Interesting. I like that idea. In what cases do you think this is useful.

some kind of distributed programs can survive the loss of one or more
processes,  also there have been fault-tolerant MPI implementations
(like FT-MPI)  that would allow a program to survive the loss of one
or more rank(s) if it were designed to.  The OpenMPI folks have been
talking about adding in some of the fault-tolerant features

