[torquedev] torque+blcr+openmpi

Eric Roman ERoman at lbl.gov
Thu Jul 8 12:09:24 MDT 2010


Here's some more info you should be aware of.  I think this is the way
we'll want everything to work -- it seems to address all of the requirements.

1.

The default setup with BLCR is to checkpoint the entire shell script.
I believe that right now, torque uses,

cr_checkpoint (--tree) PID-OF-SHELL

To perform the checkpoint of the shell, and all child processes of the
shell.  (Everything underneath pstree).

This includes the mpirun process.

2.

When mpirun receives a checkpoint signal, it can ignore it.  checkpointing
mpirun only checkpoints the mpirun process, not the MPI application
processes (ranks).

The ranks themselves get confused when they are sent a checkpoint signal
directly by cr_checkpoint, since they expect a coordination message to be
received before they checkpoint.

ompi-checkpoint sends this checkpoint message.

3.

What we need is a process to act as a wrapper between the cr_checkpoint in
1, and the ompi-checkpoint in step 2.  (The same goes for restart.)

The desired behavior is this:

The 'wrapper' process, is started by the job script, and wraps mpirun.

The wrapper must execute mpirun outside of the process tree of the job,
to ensure that the mpirun process and ranks are NOT checkpointed directly
by BLCR.  

Instead, the wrapper must ensure that ompi-checkpoint is called when the
shell script is checkpointed.  On receiving a checkpoint request, (via
notification by a BLCR handler) the wrapper invokes ompi-checkpoint to
start the checkpoint of the MPI application.

4.

On restart, we'll have two different sets of context files.  The Torque
context file contains a checkpointed job script, and a checkpoint of the
wrapper process.

Once restarted, the wrapper process invokes ompi-restart (rather than mpirun)
to restart the MPI application, again in a separate process tree, to ensure
that the restarted job remains checkpointable.

5.

This should do it.  I discussed this with the openmpi developers a while
back, and that's how they wanted this integration to work.  (The checkpoint
support in openmpi isn't capable of dealing with process coordination while
a BLCR checkpoint request is active on a rank process -- so we decided to
decouple the checkpoints, and use a wrapper to pass notifications back and
forth.)

Once this is done, you'll be able to checkpoint both serial and parallel
jobs, with their shell scripts.  I expect to use more or less 
the same scheme for MVAPICH.

6.

I'm planning to finish this once I've finished porting BLCR to the 2.6.33
kernels.  There seems to be a lot of interest in this mailing list in getting
everything working together.  Most of the tricky code is written and working,
the wrapper seems to run fine outside of Torque, but needs two pieces of work:

- First, it needs to spawn the mpirun and ompi-restart in a separate process
  tree.

- Second, it needs to hold a pipe open to that separate tree to pass
  checkpoint, restart, job exit, and regular signal notifications back and
  forth.

7.

For now, I think you're best off checkpointing the orterun job and restarting
it in the way you've described.  It's not ideal.  I expect to be able to finish
this and get simultaneous support for BLCR with openMPI shell scripts within
the month.

Eric

On Thu, Jul 08, 2010 at 04:12:50PM +0200, Peter Kruse wrote:
> Chuck Ritter wrote:
> > That is interesting. I will have to look into ompi-checkpoint. I did
> > not know it existed. It must itself depend on blcr??
> 
> yes, it's part of OpenMPI.
> 
> > So what you are saying is the PBS job cannot be preserved (currently),
> > but that it can be checkpoint/restarted under a new PBS job id. Is
> > that correct?
> 
> yes, if you wanted to be able to checkpoint the PBS job then you must
> find a way to checkpoint the Script, or better the shell that runs
> the script.  On restart you then want to resume the script and with
> it the MPI job of course and any other commands that might be in
> that script (post processing, another mpirun, ...).  Currently
> I haven't found a way to achieve that.  But we can checkpoint
> one job started with orterun.
> 
> > If that is the case, I guess it would be possible to cause
> > ompi-checkpoint to run as part of some job preemption script, and then
> > save the job information in a "to be restarted" queue (external to
> > pbs). One problem with that approach would be the kludgy support of
> > other MPI libraries (eg. mvapich).
> 
> I haven't tested mvapich yet, but according to the user's guide
> it should be straight forward:
> http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.4rc1.html#x1-310006.3
> 
> Regards,
> 
>    Peter
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list