[torquedev] Should a communication error between pbs_mom's kill a job ?
jbernstein at penguincomputing.com
Mon May 18 13:53:37 MDT 2009
Glen Beane wrote:
> On Mon, May 18, 2009 at 2:57 PM, Joshua Bernstein
> <jbernstein at penguincomputing.com> wrote:
>> Glen Beane wrote:
>>> by the way, I was already working on a job attribute called
>>> "fault_tolerant" that prevents TORQUE from killing a job if a sister
>>> node goes down. I've just about wrapped this up. A system admin
>>> could set the default value of this to true (I was going to make this
>>> a torque.cfg option)
>> Interesting. I like that idea. In what cases do you think this is useful.
> some kind of distributed programs can survive the loss of one or more
> processes, also there have been fault-tolerant MPI implementations
> (like FT-MPI) that would allow a program to survive the loss of one
> or more rank(s) if it were designed to. The OpenMPI folks have been
> talking about adding in some of the fault-tolerant features
Exactly. I was thinking along the lines of the same thing. HP-MPI can do
interconnect failover which is helpful in some circumstances.
More information about the torquedev