[torquedev] pbs_mom -p and rerunnable jobs
Ken Nielson
knielson at adaptivecomputing.com
Tue Jan 12 12:24:41 MST 2010
Wendy Lin wrote:
>
> From Torque 2.3.7 man page:
>
>
> ----
> Normally the mini-server is started from the system boot file without
> the -p or the -r option. The mini-server will make no attempt to
> signal the former session of any job which may have been running
> when the mini-server terminated. It is assumed that on reboot, all
> processes have been killed. The MOM will mark the jobs
> as terminated and notify the batch server which owns the job.
> ----
>
> From OpenPBS man page:
>
> ----
> Normally the mini-server is started from the system boot file without
> the -p or the -r option. The mini-server will make no attempt to
> signal the former session of any job which may have been running when
> the mini-server terminated. It is assumed that on reboot, all
> processes have been killed.
> ----
>
> Looks like Torque added the MOM's sending obit part, which I believe
> is the problem.
I do not think that sending an obit is a problem. If the mom dies it
needs to notify the server of what it believes the status of its jobs
are so the server can update its queue.
It seems we have some middle ground that is not getting covered.
It seems we need an option where the mom will re-queue jobs but not kill
them and one that will report that jobs are done but not adopt them.
Either of these options could be used anytime at the administrators
discretion but the idea is to use them after a system reboot or if the
mom has been down for a long time. In either case it is assumed the job
processes are no longer running. In the first case all jobs will be
re-queued so they can be run again. In the second case all jobs are
simply deleted.
Changing default behavior has already been tried and has brought us to
this point. From the responses I received it sounds like many users have
come to depend on -p as the default. I will leave that behavior in
place and I will add the other two options and make documentation
available in the man page and on the www.clusterresources.com torque doc
web site.
Ken Nielson
Adaptive Computing
More information about the torquedev
mailing list