[torqueusers] Question About Desired Behavior

John Valdes valdes at anl.gov
Tue Mar 26 12:34:49 MDT 2013

Glen Beane wrote:
> David Beer wrote:
> > Our QA tests have exposed that when a job file is loaded saying that it's
> > state is running but there is no exec host list defined [...]
> > I can think of two different behaviors:
> >
> > 1. delete the job
> > 2. requeue the job
> >
> > Which one would you all prefer?
> of course, the best solution is to find the bug that caused the
> corruption and keep the job file consistent.  In my opinion, it is
> hard to know what the "right thing" to do is in this case.  Is this
> something you see often?  Silently handling it (rerunning) may not be
> the right thing -- this should definitely be brought to someone's
> attention. But at the same time, "disappearing jobs" aren't good
> either [...]
> I guess I would lean towards rerunning if the job is "rerunnable",
> emailing the user with some kind of error message if it is not,  and
> in all cases log the error.  But i'm not 100% convinced.

My sentiments match Glen's.  How about a 3rd option:

3. place a system hold on the job

Then the admin can investigate to see why the job got into this state
and determine the proper course of action.


John Valdes                  Mathematics and Computer Science Division
valdes at anl.gov                             Argonne National Laboratory

More information about the torqueusers mailing list