[torqueusers] Question About Desired Behavior
dbeer at adaptivecomputing.com
Tue Mar 26 13:29:42 MDT 2013
On Tue, Mar 26, 2013 at 12:44 PM, Kevin Van Workum <vanw at sabalcore.com>wrote:
> On Tue, Mar 26, 2013 at 2:34 PM, John Valdes <valdes at anl.gov> wrote:
>> Glen Beane wrote:
>> > David Beer wrote:
>> > > Our QA tests have exposed that when a job file is loaded saying that
>> > > state is running but there is no exec host list defined [...]
>> > > I can think of two different behaviors:
>> > >
>> > > 1. delete the job
>> > > 2. requeue the job
>> > >
>> > > Which one would you all prefer?
>> > of course, the best solution is to find the bug that caused the
>> > corruption and keep the job file consistent. In my opinion, it is
>> > hard to know what the "right thing" to do is in this case. Is this
>> > something you see often? Silently handling it (rerunning) may not be
>> > the right thing -- this should definitely be brought to someone's
>> > attention. But at the same time, "disappearing jobs" aren't good
>> > either [...]
>> > I guess I would lean towards rerunning if the job is "rerunnable",
>> > emailing the user with some kind of error message if it is not, and
>> > in all cases log the error. But i'm not 100% convinced.
Writing code to handle error cases isn't in lieu of trying to prevent them
from happening. However, it is important to note that for many things, such
as dealing with a filesystem or a network, it is virtually impossible to
prevent them from happening in all cases, and if you don't write error
handling code you are setting yourself up for failure.
> My sentiments match Glen's. How about a 3rd option:
>> 3. place a system hold on the job
>> Then the admin can investigate to see why the job got into this state
>> and determine the proper course of action.
> I agree with option 3, system hold it, then requeue.
Sounds like the consensus is to place a system hold on the job.
> torqueusers mailing list
> torqueusers at supercluster.org
David Beer | Senior Software Engineer
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers