[torquedev] Should a communication error between pbs_mom's kill a job ?

Joshua Bernstein jbernstein at penguincomputing.com
Mon May 18 13:53:37 MDT 2009



Glen Beane wrote:
> On Mon, May 18, 2009 at 2:57 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com> wrote:
>>
>> Glen Beane wrote:
>>> by the way,  I was already working on a job attribute called
>>> "fault_tolerant" that prevents TORQUE from killing a job if a sister
>>> node goes down.  I've just about wrapped this up.  A system admin
>>> could set the default value of this to true (I was going to make this
>>> a torque.cfg option)
>> Interesting. I like that idea. In what cases do you think this is useful.
> 
> some kind of distributed programs can survive the loss of one or more
> processes,  also there have been fault-tolerant MPI implementations
> (like FT-MPI)  that would allow a program to survive the loss of one
> or more rank(s) if it were designed to.  The OpenMPI folks have been
> talking about adding in some of the fault-tolerant features

Exactly. I was thinking along the lines of the same thing. HP-MPI can do 
interconnect failover which is helpful in some circumstances.

-Josh


More information about the torquedev mailing list