[torquedev] pbs_mom -p and rerunnable jobs
Wendy Lin
hclin at lbl.gov
Mon Jan 11 09:38:47 MST 2010
>> From reading it sounds to me like the problem is not that pbs_mom
>> tries to recover jobs. The problem is that it kills re-runnable
>> jobs, which have indeed been restarted.
>
> If pbs_mom is restarted with jobs running and those jobs are still
> running when pbs_mom is initialized the mom tracks those jobs even
> though she no longer owns them. This is the behavior we want. Glen
> points out there is a possibility pbs_mom could pick up the wrong
> job because the pid of a currently running job is the same as the
> pid of a previous job. For the case of a pbs_mom restart this is
> not likely. In the case where a machine is restarted this scenario
> is more likely. In this case it seems the -q or -r option should be
> used to tell the mom to discard any TK files. Of course -q and -r
> could be used on a restart of pbs_mom as well.
From pbs_mom man page:
-q ...... With the -q option,
The MOM will mark the jobs as terminated and
notify the
batch server which owns the job......
-r ...... With the -r option,
MOM will kill any processes belonging to
jobs, mark the
jobs as terminated, and notify the batch
server which
owns the job.
I don't see how either "-q" or "-r" can work as the server purges
these jobs in response to MOM's notifications. The -q was new in
Torque, so I tested it, and it did not help.
> At this point I am going to work on the problem where re-runnable
> jobs are killed after they have been restarted. Please respond if
> there is something wrong with this analysis.
MOM's default behavior got changed at some point in Torque. Would it
be easier if you just recovered the old code?
--
Wendy Lin
Computation Systems Group
hclin at lbl.gov
More information about the torquedev
mailing list