[torquedev] Should a communication error between pbs_mom's kill a job ?

Joshua Bernstein jbernstein at penguincomputing.com
Mon May 18 12:57:13 MDT 2009



Glen Beane wrote:
> by the way,  I was already working on a job attribute called
> "fault_tolerant" that prevents TORQUE from killing a job if a sister
> node goes down.  I've just about wrapped this up.  A system admin
> could set the default value of this to true (I was going to make this
> a torque.cfg option)

Interesting. I like that idea. In what cases do you think this is useful.

-Joshua Bernstein
Software Engineer
Penguin Computing


More information about the torquedev mailing list