[torqueusers] Question About Desired Behavior

Kevin Van Workum vanw at sabalcore.com
Tue Mar 26 12:44:24 MDT 2013


On Tue, Mar 26, 2013 at 2:34 PM, John Valdes <valdes at anl.gov> wrote:

> Glen Beane wrote:
> > David Beer wrote:
> > > Our QA tests have exposed that when a job file is loaded saying that
> it's
> > > state is running but there is no exec host list defined [...]
> > > I can think of two different behaviors:
> > >
> > > 1. delete the job
> > > 2. requeue the job
> > >
> > > Which one would you all prefer?
> >
> > of course, the best solution is to find the bug that caused the
> > corruption and keep the job file consistent.  In my opinion, it is
> > hard to know what the "right thing" to do is in this case.  Is this
> > something you see often?  Silently handling it (rerunning) may not be
> > the right thing -- this should definitely be brought to someone's
> > attention. But at the same time, "disappearing jobs" aren't good
> > either [...]
> >
> > I guess I would lean towards rerunning if the job is "rerunnable",
> > emailing the user with some kind of error message if it is not,  and
> > in all cases log the error.  But i'm not 100% convinced.
>
> My sentiments match Glen's.  How about a 3rd option:
>
> 3. place a system hold on the job
>
> Then the admin can investigate to see why the job got into this state
> and determine the proper course of action.
>
> John
>

I agree with option 3, system hold it, then requeue.

-- 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130326/b4918559/attachment.html 


More information about the torqueusers mailing list