[torquedev] Should a communication error between pbs_mom's kill a job ?

Glen Beane glen.beane at gmail.com
Mon May 18 07:07:48 MDT 2009


by the way,  I was already working on a job attribute called
"fault_tolerant" that prevents TORQUE from killing a job if a sister
node goes down.  I've just about wrapped this up.  A system admin
could set the default value of this to true (I was going to make this
a torque.cfg option)

Of course removing this check might make my work thus far a waste of time.



On Mon, May 18, 2009 at 8:54 AM, Michael Barnes <barnes at jlab.org> wrote:
> On Sat, May 16, 2009 at 11:12:08PM -0400, Glen Beane wrote:
>> On Wed, May 6, 2009 at 7:17 PM, Chris Samuel <csamuel at vpac.org> wrote:
>> >
>> > ----- "Bas van der Vlies" <basv at sara.nl> wrote:
>> >
>> >> Chris Samuel wrote:
>> >>
>> >> Could this be an option in the mom config to turn this on or off?
>> >
>> > I don't mind that, though I'm still struggling to
>> > think of an instance when this check is useful!  :-)
>>
>> so what is the consensus?  Remove the behavior, or create a mom config
>> option to control it?  I don't mind doing the work to create the
>> config option.
>
> I can't think of a reason why a network error should terminate a job.
>
> The pbs_mom already has:
>
> $node_check_script
>
> and
>
> $down_on_error
>
> Which can test anything that a system administrator wants to test
> regarding the fitness of a machine.
>
> I would say take that check out of the mom completely.  Its unnecessary.
>
> -mb
>
> --
> +-----------------------------------------------
> | Michael Barnes
> |
> | Thomas Jefferson National Accelerator Facility
> | 12000 Jefferson Ave.
> | Newport News, VA 23606
> | (757) 269-7634
> +-----------------------------------------------
>


More information about the torquedev mailing list