[torqueusers] Default job recovery behavior for pbs_mom
Ken Nielson
knielson at adaptivecomputing.com
Fri Jan 15 14:57:47 MST 2010
Hi all,
Forgive me for spending so much bandwidth on this problem. However, I
have a final story on what is to be the default behavior for the MOM
when it is initialized after going down with running jobs. I'm sorry to
say there is something not to like about this for everyone. But bugs
have been fixed and behavior will be known so hopefully any
inconvenience will be short-lived.
To make sense of the explanation below I need to define two terms.
Terminate and kill.
When the mom Terminates a job it is deleting its record of the job and
informing the batch server. Terminate does not apply to a running
process. The term kill is used to indicate how a running process is
terminated. Not the job.
In 2.3.x and before by default when the pbs_mom initializes it
terminates any previously running jobs and informs the batch server.
Re-runnable jobs are re-queued by the batch server. The MOM is not
suppose to try and kill any running job processes. Previous to the fix I
checked in today the MOM would terminate the job and kill any running
processes by default.
In 2.4.x and beyond the default behavior changed to the -p option which
was to try and preserve jobs. That is to say that when the mom
reinitialized it would look for processes with the same pid as any of
the jobs recovered and then track those jobs assuming they were the same
running processes that existed before the MOM shut down.
The -q flag was added to 2.4.x to allow users to create the 2.3.x
default functionality.
The -r flag terminates all jobs that were running when the mom shut down
and then kills any running processes with a pid that matches the pid of
the recovered jobs. Re-runnable jobs are re-queued by the batch server.
I added a -P (cap P) option to 2.4.x which is similar to the -p option.
The difference is that the -P option terminates all jobs and does not
try to recover running processes.
I have updated the pbs_mom man pages for both 2.3 and 2.4 versions.
I have created a snapshot for 2.3 and one for 2.4. They can be found at
the following:
http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.10-snap.201001151340.tar.gz
<http://www.clusterresources.com/downloads/torque/snapshots/torque-2.3.10-snap.201001151340.tar.gz>
http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.5-snap.201001151416.tar.gz
<http://www.clusterresources.com/downloads/torque/snapshots/torque-2.4.5-snap.201001151416.tar.gz>
Please feel free to download these and try them out. Any feedback is
welcome.
Regards
Ken Nielson
Adaptive Computing
More information about the torqueusers
mailing list