[torquedev] Should a communication error between pbs_mom's kill a job ?

Michael Barnes barnes at jlab.org
Wed May 6 06:23:10 MDT 2009


On Wed, May 06, 2009 at 11:13:26AM +0200, Bas van der Vlies wrote:
> Chris Samuel wrote:
> > ----- "Michael Barnes" <barnes at jlab.org> wrote:
> > 
> >> I've always modified the code so that a mom could not kill a job
> >> whenever I install PBS/TORQUE. I cannot think of a reason why one mom
> >> should terminate a job unless the job has actually gone over resource
> >> limits as the function name implies.
> > 
> > Thanks - nice to know I'm not the only one who feels that way!
> > 
> > Does anyone else have any thoughts on this ?
> > 
> > Any objections to me submitting a patch to revert
> > this behaviour ?
> > 
> 
> Could this be an option in the mom config to turn this on or off?

This could be a configurable option, but every time I've patched the
pbs_mom to keep it from telling the mother superior to kill a job, I've
left the log message in the code. I've done this since OpenPBS back in
2001 or 2002 and I've seen the log message, but I've never seen a failed
job coincide with the log message. Obviously, I've seen the inverse, and
that is why I've always patched the code.

If this were to be a configurable option, in my experience, I believe
that the default should be to not kill a job, and to log the
communication error as a warning.

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list