[torqueusers] Question About Desired Behavior
glen.beane at gmail.com
Tue Mar 26 12:58:15 MDT 2013
On Tue, Mar 26, 2013 at 2:44 PM, Kevin Van Workum <vanw at sabalcore.com>wrote:
> On Tue, Mar 26, 2013 at 2:34 PM, John Valdes <valdes at anl.gov> wrote:
>> Glen Beane wrote:
>> > David Beer wrote:
>> > > Our QA tests have exposed that when a job file is loaded saying that
>> > > state is running but there is no exec host list defined [...]
>> > > I can think of two different behaviors:
>> > >
>> > > 1. delete the job
>> > > 2. requeue the job
>> > >
>> > > Which one would you all prefer?
>> > of course, the best solution is to find the bug that caused the
>> > corruption and keep the job file consistent. In my opinion, it is
>> > hard to know what the "right thing" to do is in this case. Is this
>> > something you see often? Silently handling it (rerunning) may not be
>> > the right thing -- this should definitely be brought to someone's
>> > attention. But at the same time, "disappearing jobs" aren't good
>> > either [...]
>> > I guess I would lean towards rerunning if the job is "rerunnable",
>> > emailing the user with some kind of error message if it is not, and
>> > in all cases log the error. But i'm not 100% convinced.
>> My sentiments match Glen's. How about a 3rd option:
>> 3. place a system hold on the job
>> Then the admin can investigate to see why the job got into this state
>> and determine the proper course of action.
> I agree with option 3, system hold it, then requeue.
that seems reasonable to me
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers