[torquedev] torque+blcr+openmpi

Eric Roman ERoman at lbl.gov
Wed Jun 30 13:50:00 MDT 2010


Ok, I see what you're doing now.  These scripts look for the mpirun PID, and
checkpoint that, rather than the job, so these won't work for serial jobs, and
they won't finish running the script if it's ever restarted.

Do you know how slurm handles this?  I haven't had a chance to look into that.

Can you manually run ompi-restart on those files?  Does that work?  It's qrls
that's causing the problems, no?  Maybe take a look at the checkpoint-side code
in MOM.  And make sure the job is going into a hold state successfully.
Your problem might be related to not using the checkpoint file that Torque
asked you to create.

I haven't decided how to handle the pbs_demux process.  I think when MOM
restarts a job, a new pbs_demux is started.  We'll need to make sure that
the new pbs_demux is used.  There's no reason to try to checkpoint it,
though we'll need to know if it has any output or input buffering.

ompi-checkpoint sends a command (through mpirun) out to the rest of the
openmpi runtime that does the necessary coordination to checkpoint the job.
When the coordination is completed, each openmpi rank starts a new BLCR
request to checkpoint itself.  I believe openmpi-checkpoint returns once
that step is complete, and prints out a name that can be passed to 

It's not so much the batch system that sends the signal to mpi.  openmpi
really wants the user to use ompi-checkpoint, not a signal.  This is important.
The trick to making this whole thing work is to ensure that no checkpoint
signals are sent directly to the mpirun process or to the MPI ranks
on the node.  When that signal is received, we need something to respond by
calling ompi-checkpoint, and blocking the checkpoint request until the
ompi-checkpoint is completed.  On restart, this same thing needs to call

I've written code to nest the checkpoints correctly, but need to write some
more to send the signals to the right places before parallel jobs will work
correctly.  The plan is basically to start a new session within the job,
and run mpirun in that second session.

Session 1
pbs_mom -- sh -- proxy_process

Session 2
slave_process -- mpirun -- local ranks

Session 1 gets checkpointed by Torque.  The proxy_process and slave_process
are connected with a pipe (or 2), and all of the standard IO streams
are connected.  The pipes are there to pass checkpoint and restart requests,
(and completion of those requests), between session 1 to session 2.  

And as I recall, I have the checkpoint/restart of the proxy working correctly
outside of torque, but not inside.  

I'm not sure yet how to fit pbs_demux into here.  Is blcr_restart_script
connected to pbs_demux when it starts?  (meaning, are the file descriptors
already open?)  If so, BLCR should be able to reconnect everything correctly.

I'd appreciate comments or questions on this.  Right now, I don't see too
many other ways to get everything working.  


On Tue, Jun 29, 2010 at 11:25:54AM +0200, Danny Sternkopf wrote:
> Eric,
> thanks for your answer.
> Please find my scripts attached. They are quite simple at the
> moment. They use ompi-checkpoint and ompi-restart for OpenMPI apps.
> They must use it at the moment. For Multi-node jobs Torque starts a
> pbs_demux process for capturing stdout/err of all MOMs which is not
> checkpointable. Probably because it is not started with cr_run or
> under similiar BLCR environment. Therefore cr_checkpoint --tree
> doesn't work for these kind of jobs.
> I do not have a deep knowledge what exactly ompi-checkpoint does.
> But I can imagine that it brings the MPI app in a proper state,
> ready for checkpointing. Therefore I'am nore sure if you could
> succesfully checkpoint/restart the MPI app without the knowledge
> about the app requirements. For example communication is one of the
> major challenges which plays a important role in that context.
> In general one would expect that the batch system sends a signal to
> MPI which then brings the app in a proper state for checkpointing
> and then the batch system just performs checkpoints for all the
> involved processes uses a general format. This could be one
> approach.
> However thats why ompi-checkpoint/ompi-restart exist and which take
> care of of this and which can make use of BLCR tools of course.
> As I said the checkpinting works for me because ompi-checkpoint does
> a good job and checkpoints, then terminates the MPI related
> processes. Torque takes care of pbs_demux and the batch job script.
> The restart does not work. I don't see that the blcr_restart_script
> is called. So I guess Torque has a problem to find the expected
> checkpoint file.
> The following article gives useful information about the current
> Torque/BLCR integration:
>  http://www.clusterresources.com/pipermail/torquedev/2010-May/002054.html
> Regards,
> Danny
> On 6/28/2010 8:08 PM, Eric Roman wrote:
> >
> >Danny,
> >
> >I worked on this a while ago, but it's been a long standing todo item to get
> >everything to work properly.
> >
> >Can you tell me what your scripts do?
> >
> >Can you restart the application manually from the context file you created?
> >(With cr_restart?)
> >
> >Tormally, torque tries to checkpoint the shell it spawned assoc'd with the job
> >using cr_checkpoint (--tree) to capture all of the children, including the
> >mpirun and the MPI ranks.  Last time I checked, mpirun wouldn't respond to a
> >cr_checkpoint.  (I think it omitted itself from the checkpoint, but I don't
> >remember).  openmpi required a user to invoke ompi-checkpoint to checkpoint an
> >app, and ompi-restart to bring the app back, but torque wants to use
> >cr_checkpoint and cr_restart on the context file.  So, I needed to wrap
> >the original openmpi mpirun with another program that would intercept the
> >checkpoint signals.
> >
> >Part of the problem is that openmpi puts some of the MPI rank into the same
> >process tree (or session) as the mpirun, and this messes everything up.  I
> >left off at the point where I needed to write startup code to ensure that
> >the ranks were in a separate process tree from the mpirun.  (The way things
> >are implemented right now, the checkpoint deadlocks, so we need to break
> >one of the dependencies to fix it.)
> >
> >The root issue is a little bit messy.  Those checkpoint/restart scripts
> >need root privileges to open the context file.  Those scripts need to open
> >the context file (as root), and then call setuid() to change into the user,
> >making sure that they pass the context file as a file descriptor to
> >cr_checkpoint and cr_restart.
> >
> >I do want to go in and fix all of this.  Right now I'm trying to get BLCR to
> >work with compressed context files, and chasing a bug with using it on
> >the 2.6.33 kernel.
> >
> >Eric
> >
> >
> >
> >On Mon, Jun 28, 2010 at 09:43:14AM +0200, Danny Sternkopf wrote:
> >>Hi,
> >>
> >>maybe someone here can comments on this.
> >>
> >>Regards,
> >>
> >>Danny
> >>
> >>-------- Original Message --------
> >>Subject: Re: [torqueusers] torque+blcr+openmpi
> >>Date: Fri, 25 Jun 2010 16:58:59 +0200
> >>From: Danny Sternkopf<dsternkopf at hpce.nec.com>
> >>Reply-To: dsternkopf at hpce.nec.com
> >>Organization: NEC Deutschland GmbH
> >>To: torqueusers at supercluster.org
> >>
> >>Hi,
> >>
> >>any news about this? I have the following setup:
> >>o torque 2.4.8
> >>o openmpi 1.4.2
> >>o blcr 0.8.2
> >>
> >>The checkpoint/restart scripts from Torque's contrib/blcr work for
> >>single node application without MPI. I created new scripts for OpenMPI
> >>applications. The checkpoint works, but the release does not. The issue
> >>might be that ompi-checkpoint writes a directory including checkpoint
> >>files for each process plus metadata and Torque expects one single
> >>checkpoint file. Any experiences?
> >>
> >>Btw another issue is that the checkpoint/restart scripts run as root.
> >>ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
> >>have to run the ompi-checkpoint as user. The restart script of course
> >>needs this as well to restart process under the corresponding user id.
> >>
> >>Furthermore any comments to handle MPI and single process applications
> >>with same checkpoint/restart scripts?
> >>
> >>Regards,
> >>
> >>Danny
> >>---
> >>_______________________________________________
> >>torquedev mailing list
> >>torquedev at supercluster.org
> >>http://www.supercluster.org/mailman/listinfo/torquedev
> >
> -- 
> Danny Sternkopf http://www.nec.de/hpc        dsternkopf at hpce.nec.com
> HPCE Division  Germany phone: +49-711-78055-33 fax: +49-711-78055-25
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NEC Deutschland GmbH, Hansaallee 101, 40549 Düsseldorf
> Geschäftsführer Richard Hanscott
> Handelsregister Düsseldorf HRB 57941; VAT ID DE129424743

> #!/bin/bash
> usage() {
>         echo -e "usage: $0 <session_id> <job_id> <user_id> <group_id> <checkpoint_dir> <checkpoint_name>\n"
>         exit 1
> }
> ## main ##
> if [ $# -eq 6 ]; then
>         sessionId=$1
>         jobId=$2
>         userId=$3
>         groupId=$4
>         checkpointDir=$5
>         checkpointName=$6
> else
>         usage
> fi
> ## main ##
> read MPIRUN_PID < ${checkpointDir}/mpirun.pid.$jobId
> echo "ompi-restart -mca snapc_base_global_snapshot_dir $checkpointDir ompi_global_snapshot_${MPIRUN_PID}.ckpt" 1>&2
> su - nectest -c "ompi-restart -mca snapc_base_global_snapshot_dir $checkpointDir ompi_global_snapshot_${MPIRUN_PID}.ckpt"

> #!/bin/bash
> usage() {
> 	echo -e "usage: $0 <session_id> <job_id> <user_id> <group_id> <checkpoint_dir> <checkpoint_name> <signal_num> <checkpoint_depth>\n"
> 	exit 1
> }
> ## main ##
> if [ $# -eq 8 ]; then
> 	sessionId=$1
> 	jobId=$2
> 	userId=$3
> 	groupId=$4
> 	checkpointDir=$5
> 	checkpointName=$6
> 	signalNum=$7
> 	checkpointDepth=$8
> else
> 	usage
> fi
> read MPIRUN_PID < ${checkpointDir}/mpirun.pid.$jobId
> echo "ompi-checkpoint -mca snapc_base_global_snapshot_dir $checkpointDir $MPIRUN_PID" 1>&2
> su - nectest -c "ompi-checkpoint --term -mca snapc_base_global_snapshot_dir $checkpointDir $MPIRUN_PID"

More information about the torquedev mailing list