[torquedev] Should a communication error between pbs_mom's kill a job ?

Michael Barnes barnes at jlab.org
Thu May 21 09:10:02 MDT 2009


On Wed, May 20, 2009 at 07:29:07AM +1000, David Singleton wrote:
> Has anyone tried to stress test this? Lots of restarting sisters
> (with long and short outages) in the presence of a variable tm load
> (lots of pbsdsh's etc of varying lengths) with this mod? Throw job
> suspend/resume and any other running job operation into the mix. The
> issue is not whether MOMs crash but whether they get confused about
> job states.

I have not stress tested these specific cases, but I have always taken
the code in question out of all of the clusters that I've administered
since the OpenPBS days.

I always left in the logging part of the code to say something like
"node x has requested job to die, but this is ignored" or something like
that to keep the logging facility in place.

I would suspect that the worse case of all of the above tests would be
that the job would die, which is already the default behavior of the
code that is in dispute. I've never seen it where the MOMs could not
recover from a network problem, server outage, or anything like between
job runs in years.

If its important enough for someone to add this as a configurable
option, that is fine. I believe it should be the default to *not* kill
the job on a network problem.

-mb

-- 
+-----------------------------------------------
| Michael Barnes
|
| Thomas Jefferson National Accelerator Facility
| 12000 Jefferson Ave.
| Newport News, VA 23606
| (757) 269-7634
+-----------------------------------------------


More information about the torquedev mailing list