[torquedev] pbs_mom -p and rerunnable jobs

Ken Nielson knielson at adaptivecomputing.com
Mon Jan 11 08:21:07 MST 2010


Hi all,

I was out Friday with a tooth infection so I have not been able to respond until this morning.

>From reading it sounds to me like the problem is not that pbs_mom tries to recover jobs. The problem is that it kills re-runnable jobs, which have indeed been restarted.

If pbs_mom is restarted with jobs running and those jobs are still running when pbs_mom is initialized the mom tracks those jobs even though she no longer owns them. This is the behavior we want. Glen points out there is a possibility pbs_mom could pick up the wrong job because the pid of a currently running job is the same as the pid of a previous job. For the case of a pbs_mom restart this is not likely. In the case where a machine is restarted this scenario is more likely. In this case it seems the -q or -r option should be used to tell the mom to discard any TK files. Of course -q and -r could be used on a restart of pbs_mom as well.

At this point I am going to work on the problem where re-runnable jobs are killed after they have been restarted. Please respond if there is something wrong with this analysis.

Thanks

Ken Nielson
Adaptive Computing





More information about the torquedev mailing list