[torquedev] Should a communication error between pbs_mom's kill a job ?

David Singleton David.Singleton at anu.edu.au
Tue May 19 15:29:07 MDT 2009


Michael Barnes wrote:
> On Tue, May 19, 2009 at 10:37:20AM +0200, Bas van der Vlies wrote:
>> Chris Samuel wrote:
>>> ----- "Glen Beane" <glen.beane at gmail.com> wrote:
>>>
>>>> so what is the consensus?  Remove the behavior, or create a mom
>>>> config option to control it?  I don't mind doing the work to create
>>>> the config option.
>>> Personally I'd be happier if it just went away, but
>>> at least one person has asked for it to be configurable.
>>>
>> +1 to remove the code.
> 
> remove++
> 

I could be wrong but I believe the logic to kill jobs on network problems
was there because  the MOM task management code was not setup for recovery
after MOM's lost connection.  There are "corner case" race conditions etc
that are difficult to cover so the PBS logic was simply to avoid getting
confused. I know Torque has code to reconnect sisters but I dont know if
that handles all these task management issues. Maybe it does. Even if it
doesn't, its possible sites may not see any problems from reconnecting
sisters.

Has anyone tried to stress test this?  Lots of restarting sisters (with
long and short outages) in the presence of a variable tm load (lots of
pbsdsh's etc of varying lengths) with this mod?  Throw job suspend/resume
and any other running job operation into the mix.  The issue is not whether
MOMs crash but whether they get confused about job states.

Cheers,
David


More information about the torquedev mailing list