[torquedev] torque+blcr+openmpi

Chuck Ritter cfr100 at psu.edu
Thu Jul 8 02:31:46 MDT 2010


On Tue, Jul 6, 2010 at 10:09 AM, Peter Kruse <pk at q-leap.com> wrote:
> Hi,
>
> currently I see no way to checkpoint the complete PBS job
> with an OpenMPI job.  It is however possible to checkpoint
> the Job itself with ompi-checkpoint.  If you also want to
> terminate the job then you could give the $signalNum the default
> value of "15" then it will terminate and free the resources.
> In this case the restart_script can resume the job.
>
> Regards,
>
>   Peter

That is interesting. I will have to look into ompi-checkpoint. I did
not know it existed. It must itself depend on blcr??

So what you are saying is the PBS job cannot be preserved (currently),
but that it can be checkpoint/restarted under a new PBS job id. Is
that correct?

If that is the case, I guess it would be possible to cause
ompi-checkpoint to run as part of some job preemption script, and then
save the job information in a "to be restarted" queue (external to
pbs). One problem with that approach would be the kludgy support of
other MPI libraries (eg. mvapich).


More information about the torquedev mailing list