[torquedev] pbs_mom -p and rerunnable jobs
knielson at adaptivecomputing.com
Mon Jan 11 13:52:28 MST 2010
Wendy Lin wrote:
> From pbs_mom man page:
> -q ...... With the -q option,
> The MOM will mark the jobs as terminated and
> notify the
> batch server which owns the job......
> -r ...... With the -r option,
> MOM will kill any processes belonging to jobs,
> mark the
> jobs as terminated, and notify the batch
> server which
> owns the job.
> I don't see how either "-q" or "-r" can work as the server purges
> these jobs in response to MOM's notifications. The -q was new in
> Torque, so I tested it, and it did not help.
After studying the code and TORQUE's behavior more I see that the
default behavior (-p) will either "adopt" any running jobs or report the
end of jobs that do not have pids that match what is in the TK files. In
both cases pbs_mom will report the end of the job, create an obituary
and pbs_server believes that the jobs ran to completion. At first it
looks like TORQUE is deleting re-runnable jobs. But it is merely
reporting the jobs finished and removes them from the queue.
If pbs_mom is restarted with the -q option and there are no jobs still
running pbs_server will re-queue the jobs so they can be run again
later. If jobs are still running when pbs_mom is started with -q, the
jobs are terminated, go to an E state but are then re-queued so they can
be run again.
> MOM's default behavior got changed at some point in Torque. Would it
> be easier if you just recovered the old code?
At this point I would say that there is some misunderstanding about what
the default behavior of pbs_mom should be if pbs_mom is terminated with
jobs running. Right now I would say that if you want to be able to rerun
jobs you should start pbs_mom with a -r. If you want the ability to
recover jobs which may still be running use the default behavior or -p.
What was the default behavior previously?
More information about the torquedev