[torqueusers] Question About Desired Behavior

Glen Beane glen.beane at gmail.com
Tue Mar 26 12:58:15 MDT 2013


On Tue, Mar 26, 2013 at 2:44 PM, Kevin Van Workum <vanw at sabalcore.com>wrote:

> On Tue, Mar 26, 2013 at 2:34 PM, John Valdes <valdes at anl.gov> wrote:
>
>> Glen Beane wrote:
>> > David Beer wrote:
>> > > Our QA tests have exposed that when a job file is loaded saying that
>> it's
>> > > state is running but there is no exec host list defined [...]
>> > > I can think of two different behaviors:
>> > >
>> > > 1. delete the job
>> > > 2. requeue the job
>> > >
>> > > Which one would you all prefer?
>> >
>> > of course, the best solution is to find the bug that caused the
>> > corruption and keep the job file consistent.  In my opinion, it is
>> > hard to know what the "right thing" to do is in this case.  Is this
>> > something you see often?  Silently handling it (rerunning) may not be
>> > the right thing -- this should definitely be brought to someone's
>> > attention. But at the same time, "disappearing jobs" aren't good
>> > either [...]
>> >
>> > I guess I would lean towards rerunning if the job is "rerunnable",
>> > emailing the user with some kind of error message if it is not,  and
>> > in all cases log the error.  But i'm not 100% convinced.
>>
>> My sentiments match Glen's.  How about a 3rd option:
>>
>> 3. place a system hold on the job
>>
>> Then the admin can investigate to see why the job got into this state
>> and determine the proper course of action.
>>
>> John
>>
>
> I agree with option 3, system hold it, then requeue.
>


that seems reasonable to me
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20130326/cc8de360/attachment.html 


More information about the torqueusers mailing list