pk at q-leap.com
Thu Jul 8 08:12:50 MDT 2010
Chuck Ritter wrote:
> That is interesting. I will have to look into ompi-checkpoint. I did
> not know it existed. It must itself depend on blcr??
yes, it's part of OpenMPI.
> So what you are saying is the PBS job cannot be preserved (currently),
> but that it can be checkpoint/restarted under a new PBS job id. Is
> that correct?
yes, if you wanted to be able to checkpoint the PBS job then you must
find a way to checkpoint the Script, or better the shell that runs
the script. On restart you then want to resume the script and with
it the MPI job of course and any other commands that might be in
that script (post processing, another mpirun, ...). Currently
I haven't found a way to achieve that. But we can checkpoint
one job started with orterun.
> If that is the case, I guess it would be possible to cause
> ompi-checkpoint to run as part of some job preemption script, and then
> save the job information in a "to be restarted" queue (external to
> pbs). One problem with that approach would be the kludgy support of
> other MPI libraries (eg. mvapich).
I haven't tested mvapich yet, but according to the user's guide
it should be straight forward:
More information about the torquedev