[torquedev] pbs_mom -p and rerunnable jobs
hclin at lbl.gov
Mon Jan 11 09:38:47 MST 2010
>> From reading it sounds to me like the problem is not that pbs_mom
>> tries to recover jobs. The problem is that it kills re-runnable
>> jobs, which have indeed been restarted.
> If pbs_mom is restarted with jobs running and those jobs are still
> running when pbs_mom is initialized the mom tracks those jobs even
> though she no longer owns them. This is the behavior we want. Glen
> points out there is a possibility pbs_mom could pick up the wrong
> job because the pid of a currently running job is the same as the
> pid of a previous job. For the case of a pbs_mom restart this is
> not likely. In the case where a machine is restarted this scenario
> is more likely. In this case it seems the -q or -r option should be
> used to tell the mom to discard any TK files. Of course -q and -r
> could be used on a restart of pbs_mom as well.
From pbs_mom man page:
-q ...... With the -q option,
The MOM will mark the jobs as terminated and
batch server which owns the job......
-r ...... With the -r option,
MOM will kill any processes belonging to
jobs, mark the
jobs as terminated, and notify the batch
owns the job.
I don't see how either "-q" or "-r" can work as the server purges
these jobs in response to MOM's notifications. The -q was new in
Torque, so I tested it, and it did not help.
> At this point I am going to work on the problem where re-runnable
> jobs are killed after they have been restarted. Please respond if
> there is something wrong with this analysis.
MOM's default behavior got changed at some point in Torque. Would it
be easier if you just recovered the old code?
Computation Systems Group
hclin at lbl.gov
More information about the torquedev