[torqueusers] Question About Desired Behavior
Kevin Van Workum
vanw at sabalcore.com
Tue Mar 26 12:44:24 MDT 2013
On Tue, Mar 26, 2013 at 2:34 PM, John Valdes <valdes at anl.gov> wrote:
> Glen Beane wrote:
> > David Beer wrote:
> > > Our QA tests have exposed that when a job file is loaded saying that
> > > state is running but there is no exec host list defined [...]
> > > I can think of two different behaviors:
> > >
> > > 1. delete the job
> > > 2. requeue the job
> > >
> > > Which one would you all prefer?
> > of course, the best solution is to find the bug that caused the
> > corruption and keep the job file consistent. In my opinion, it is
> > hard to know what the "right thing" to do is in this case. Is this
> > something you see often? Silently handling it (rerunning) may not be
> > the right thing -- this should definitely be brought to someone's
> > attention. But at the same time, "disappearing jobs" aren't good
> > either [...]
> > I guess I would lean towards rerunning if the job is "rerunnable",
> > emailing the user with some kind of error message if it is not, and
> > in all cases log the error. But i'm not 100% convinced.
> My sentiments match Glen's. How about a 3rd option:
> 3. place a system hold on the job
> Then the admin can investigate to see why the job got into this state
> and determine the proper course of action.
I agree with option 3, system hold it, then requeue.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers