[torquedev] pbs_mom -p and rerunnable jobs
knielson at adaptivecomputing.com
Mon Jan 11 14:23:52 MST 2010
Ken Nielson wrote:
> Wendy Lin wrote:
>> From pbs_mom man page:
>> -q ...... With the -q option,
>> The MOM will mark the jobs as terminated and
>> notify the
>> batch server which owns the job......
>> -r ...... With the -r option,
>> MOM will kill any processes belonging to jobs,
>> mark the
>> jobs as terminated, and notify the batch
>> server which
>> owns the job.
>> I don't see how either "-q" or "-r" can work as the server purges
>> these jobs in response to MOM's notifications. The -q was new in
>> Torque, so I tested it, and it did not help.
> After studying the code and TORQUE's behavior more I see that the
> default behavior (-p) will either "adopt" any running jobs or report the
> end of jobs that do not have pids that match what is in the TK files. In
> both cases pbs_mom will report the end of the job, create an obituary
> and pbs_server believes that the jobs ran to completion. At first it
> looks like TORQUE is deleting re-runnable jobs. But it is merely
> reporting the jobs finished and removes them from the queue.
> If pbs_mom is restarted with the -q option and there are no jobs still
> running pbs_server will re-queue the jobs so they can be run again
> later. If jobs are still running when pbs_mom is started with -q, the
> jobs are terminated, go to an E state but are then re-queued so they can
> be run again.
>> MOM's default behavior got changed at some point in Torque. Would it
>> be easier if you just recovered the old code?
> At this point I would say that there is some misunderstanding about what
> the default behavior of pbs_mom should be if pbs_mom is terminated with
> jobs running. Right now I would say that if you want to be able to rerun
> jobs you should start pbs_mom with a -r.
Correct that. you should run pbs_mom with a -q. Sorry for not proofing
> If you want the ability to
> recover jobs which may still be running use the default behavior or -p.
> What was the default behavior previously?
> torquedev mailing list
> torquedev at supercluster.org
More information about the torquedev