[torquedev] Should a communication error between pbs_mom's kill a job ?

Bas van der Vlies basv at sara.nl
Wed May 6 07:30:03 MDT 2009


Michael Barnes wrote:
> On Wed, May 06, 2009 at 11:13:26AM +0200, Bas van der Vlies wrote:
>> Chris Samuel wrote:
>>> ----- "Michael Barnes" <barnes at jlab.org> wrote:
>>>
>>>> I've always modified the code so that a mom could not kill a job
>>>> whenever I install PBS/TORQUE. I cannot think of a reason why one mom
>>>> should terminate a job unless the job has actually gone over resource
>>>> limits as the function name implies.
>>> Thanks - nice to know I'm not the only one who feels that way!
>>>
>>> Does anyone else have any thoughts on this ?
>>>
>>> Any objections to me submitting a patch to revert
>>> this behaviour ?
>>>
>> Could this be an option in the mom config to turn this on or off?
> 
> This could be a configurable option, but every time I've patched the
> pbs_mom to keep it from telling the mother superior to kill a job, I've
> left the log message in the code. I've done this since OpenPBS back in
> 2001 or 2002 and I've seen the log message, but I've never seen a failed
> job coincide with the log message. Obviously, I've seen the inverse, and
> that is why I've always patched the code.
> 
> If this were to be a configurable option, in my experience, I believe
> that the default should be to not kill a job, and to log the
> communication error as a warning.
> 

I agree

-- 
********************************************************************
*  Bas van der Vlies                    e-mail: basv at sara.nl       *
*  SARA - Academic Computing Services   Amsterdam, The Netherlands *
********************************************************************


More information about the torquedev mailing list