[torquedev] pbs_mom -p and rerunnable jobs
knielson at adaptivecomputing.com
Mon Jan 11 08:21:07 MST 2010
I was out Friday with a tooth infection so I have not been able to respond until this morning.
>From reading it sounds to me like the problem is not that pbs_mom tries to recover jobs. The problem is that it kills re-runnable jobs, which have indeed been restarted.
If pbs_mom is restarted with jobs running and those jobs are still running when pbs_mom is initialized the mom tracks those jobs even though she no longer owns them. This is the behavior we want. Glen points out there is a possibility pbs_mom could pick up the wrong job because the pid of a currently running job is the same as the pid of a previous job. For the case of a pbs_mom restart this is not likely. In the case where a machine is restarted this scenario is more likely. In this case it seems the -q or -r option should be used to tell the mom to discard any TK files. Of course -q and -r could be used on a restart of pbs_mom as well.
At this point I am going to work on the problem where re-runnable jobs are killed after they have been restarted. Please respond if there is something wrong with this analysis.
More information about the torquedev