[torquedev] Should a communication error between pbs_mom's kill a job ?

Michael Barnes barnes at jlab.org
Mon May 18 06:54:08 MDT 2009


On Sat, May 16, 2009 at 11:12:08PM -0400, Glen Beane wrote:
> On Wed, May 6, 2009 at 7:17 PM, Chris Samuel <csamuel at vpac.org> wrote:
> >
> > ----- "Bas van der Vlies" <basv at sara.nl> wrote:
> >
> >> Chris Samuel wrote:
> >>
> >> Could this be an option in the mom config to turn this on or off?
> >
> > I don't mind that, though I'm still struggling to
> > think of an instance when this check is useful!  :-)
> 
> so what is the consensus?  Remove the behavior, or create a mom config
> option to control it?  I don't mind doing the work to create the
> config option.

I can't think of a reason why a network error should terminate a job.

The pbs_mom already has:

$node_check_script

and

$down_on_error

Which can test anything that a system administrator wants to test
regarding the fitness of a machine.

I would say take that check out of the mom completely.  Its unnecessary.

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list