[torquedev] Should a communication error between pbs_mom's kill a job ?
Michael Barnes
barnes at jlab.org
Mon May 18 06:54:08 MDT 2009
On Sat, May 16, 2009 at 11:12:08PM -0400, Glen Beane wrote:
> On Wed, May 6, 2009 at 7:17 PM, Chris Samuel <csamuel at vpac.org> wrote:
> >
> > ----- "Bas van der Vlies" <basv at sara.nl> wrote:
> >
> >> Chris Samuel wrote:
> >>
> >> Could this be an option in the mom config to turn this on or off?
> >
> > I don't mind that, though I'm still struggling to
> > think of an instance when this check is useful! :-)
>
> so what is the consensus? Remove the behavior, or create a mom config
> option to control it? I don't mind doing the work to create the
> config option.
I can't think of a reason why a network error should terminate a job.
The pbs_mom already has:
$node_check_script
and
$down_on_error
Which can test anything that a system administrator wants to test
regarding the fitness of a machine.
I would say take that check out of the mom completely. Its unnecessary.
-mb
--
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------
More information about the torquedev
mailing list