[torqueusers] Question About Desired Behavior

Glen Beane glen.beane at gmail.com
Tue Mar 26 12:24:31 MDT 2013


On Tue, Mar 26, 2013 at 2:14 PM, David Beer <dbeer at adaptivecomputing.com> wrote:
>
>
> On Tue, Mar 26, 2013 at 11:33 AM, Glen Beane <glen.beane at gmail.com> wrote:
>>
>> On Tue, Mar 26, 2013 at 12:41 PM, David Beer
>> <dbeer at adaptivecomputing.com> wrote:
>> > All,
>> >
>> > Our QA tests have exposed that when a job file is loaded saying that
>> > it's
>> > state is running but there is no exec host list defined we don't handle
>> > this
>> > state, that is, we attempt to perform actions on the job that assume it
>> > is
>> > running, but we can't talk to the mom because we don't know what mom it
>> > is.
>> > I can think of two different behaviors:
>> >
>> > 1. delete the job
>> > 2. requeue the job
>> >
>> > Which one would you all prefer?
>>
>>
>> how does a job get into this state in the first place?
>
>
> At this point it appears to be a corrupted job file. More than that we don't
> know, but we need to handle this.


of course, the best solution is to find the bug that caused the
corruption and keep the job file consistent.  In my opinion, it is
hard to know what the "right thing" to do is in this case.  Is this
something you see often?  Silently handling it (rerunning) may not be
the right thing -- this should definitely be brought to someone's
attention. But at the same time, "disappearing jobs" aren't good
either  (we have diskless nodes and right now if a node reboots with a
job running that job gets deleted by pbs_server and the user never
gets any kind of notification, like an abort email, saying why it was
deleted)

I guess I would lean towards rerunning if the job is "rerunnable",
emailing the user with some kind of error message if it is not,  and
in all cases log the error.  But i'm not 100% convinced.


More information about the torqueusers mailing list