[torqueusers] Question About Desired Behavior

John Valdes valdes at anl.gov
Tue Mar 26 12:34:49 MDT 2013


Glen Beane wrote:
> David Beer wrote:
> > Our QA tests have exposed that when a job file is loaded saying that it's
> > state is running but there is no exec host list defined [...]
> > I can think of two different behaviors:
> >
> > 1. delete the job
> > 2. requeue the job
> >
> > Which one would you all prefer?
> 
> of course, the best solution is to find the bug that caused the
> corruption and keep the job file consistent.  In my opinion, it is
> hard to know what the "right thing" to do is in this case.  Is this
> something you see often?  Silently handling it (rerunning) may not be
> the right thing -- this should definitely be brought to someone's
> attention. But at the same time, "disappearing jobs" aren't good
> either [...]
> 
> I guess I would lean towards rerunning if the job is "rerunnable",
> emailing the user with some kind of error message if it is not,  and
> in all cases log the error.  But i'm not 100% convinced.

My sentiments match Glen's.  How about a 3rd option:

3. place a system hold on the job

Then the admin can investigate to see why the job got into this state
and determine the proper course of action.

John

----------------------------------------------------------------------
John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory


More information about the torqueusers mailing list