[torquedev] pbs_mom -p and rerunnable jobs

Ken Nielson knielson at adaptivecomputing.com
Mon Jan 11 13:52:28 MST 2010

Wendy Lin wrote:
> From pbs_mom man page:
>       -q              ......  With the -q option,
>                        The MOM will mark the jobs as terminated and 
> notify the
>                        batch server which owns the job......
>        -r             ......  With the -r option,
>                        MOM will kill any processes belonging to jobs, 
> mark the
>                        jobs  as  terminated, and notify the batch 
> server which
>                        owns the job.
> I don't see how either "-q" or "-r" can work as the server purges 
> these jobs in response to MOM's notifications. The -q was new in 
> Torque, so I tested it, and it did not help.
After studying the code and TORQUE's behavior more I see that the 
default behavior (-p) will either "adopt" any running jobs or report the 
end of jobs that do not have pids that match what is in the TK files. In 
both cases pbs_mom will report the end of the job, create an obituary 
and pbs_server believes that the jobs ran to completion. At first it 
looks like TORQUE is deleting re-runnable jobs. But it is merely 
reporting the jobs finished and removes them from the queue.

If pbs_mom is restarted with the -q option and there are no jobs still 
running pbs_server will re-queue the jobs so they can be run again 
later.  If jobs are still running when pbs_mom is started with -q, the 
jobs are terminated, go to an E state but are then re-queued so they can 
be run again.
> MOM's default behavior got changed at some point in Torque. Would it 
> be easier if you just recovered the old code?
At this point I would say that there is some misunderstanding about what 
the default behavior of pbs_mom should be if pbs_mom is terminated with 
jobs running. Right now I would say that if you want to be able to rerun 
jobs you should start pbs_mom with a -r. If you want the ability to 
recover jobs which may still be running use the default behavior or -p.

What was the default behavior previously?



