cfr100 at psu.edu
Thu Jul 8 02:31:46 MDT 2010
On Tue, Jul 6, 2010 at 10:09 AM, Peter Kruse <pk at q-leap.com> wrote:
> currently I see no way to checkpoint the complete PBS job
> with an OpenMPI job. It is however possible to checkpoint
> the Job itself with ompi-checkpoint. If you also want to
> terminate the job then you could give the $signalNum the default
> value of "15" then it will terminate and free the resources.
> In this case the restart_script can resume the job.
That is interesting. I will have to look into ompi-checkpoint. I did
not know it existed. It must itself depend on blcr??
So what you are saying is the PBS job cannot be preserved (currently),
but that it can be checkpoint/restarted under a new PBS job id. Is
If that is the case, I guess it would be possible to cause
ompi-checkpoint to run as part of some job preemption script, and then
save the job information in a "to be restarted" queue (external to
pbs). One problem with that approach would be the kludgy support of
other MPI libraries (eg. mvapich).
More information about the torquedev