[torquedev] Should a communication error between pbs_mom's kill a job ?

Glen Beane glen.beane at gmail.com
Fri May 22 21:48:17 MDT 2009


so I'm pretty close to checking in the changes I mentioned below.  I
started with a community patch that made a mom_config option to
control whether or not pjob->ji_nodekill is set after a POLL request
to a sister fails in mom_comm.c. If this doesn't get set, then
job_over_limit does not kill the job with the "node X requested the
job terminate" error.  Someone requested this be controllable on a per
job basis, so instead of a mom config file option, it is now
controlled via a job attribute (right now called fault_tolerant).  The
particular user wanted this feature to enable certain jobs to survive
the complete loss of a sister.

This attribute defaults to false, so the default behavior is opposite
of what you all want.  I was going to put in a torque.cfg option to
specify that the default value should be true instead (torque.cfg is
only used by qsub, if this option is set in torque.cfg and
fault_tolerant is not specified to qsub then it would set it to true).


However, based on this conversation, I think the best thing to do
would be to get rid of this new attribute and change the mom code so
that the mother superior never sets pjob->ji_nodekill when it gets an
error from a POLL request...




On Mon, May 18, 2009 at 9:07 AM, Glen Beane <glen.beane at gmail.com> wrote:
> by the way,  I was already working on a job attribute called
> "fault_tolerant" that prevents TORQUE from killing a job if a sister
> node goes down.  I've just about wrapped this up.  A system admin
> could set the default value of this to true (I was going to make this
> a torque.cfg option)
>
> Of course removing this check might make my work thus far a waste of time.


More information about the torquedev mailing list