[torquedev] torque+blcr+openmpi

Peter Kruse pk at q-leap.com
Thu Jul 1 05:29:08 MDT 2010


I attach the two scripts that we use.  They are based on the scripts
found on
But with these additions:


1. support ompi-checkpoint and cr_checkpoint
     it checks if orterun is a parent process, if so uses ompi-checkpoint
     otherweise uses cr_checkpoint
2. for ompi-checkpoint the checkpoint directory cannot be given on
     commandline, orterun uses the parameter snapc_base_global_snapshot_dir
     which is already set.  Therefore ignore the $checkpointDir/$checkpointName.
     Instead store a mapping of $JOBID:$Snapref in a file (where $Snapref is
     returned by the ompi-checkpoint command).  Additionally store the
     node geometry which is used in a script that restarts the job.


3. if the given $jobid is found in the jobid2ompi_snap_ref file then
     use "ompi-restart $ref" otherwise use cr_restart with the given


this script is meant to be run in a Torque Job, so that
$PBS_NODEILFE is set.  If given the JobID to restart
it will first check if the node geometry matches the one
of that job, if it matches then calls ompi-restart with
the snapshot reference.

I hope they may be useful for you.



Eric Roman wrote:
> Danny,
> Ok, I see what you're doing now.  These scripts look for the mpirun PID, and
> checkpoint that, rather than the job, so these won't work for serial jobs, and
> they won't finish running the script if it's ever restarted.
> Do you know how slurm handles this?  I haven't had a chance to look into that.
> Can you manually run ompi-restart on those files?  Does that work?  It's qrls
> that's causing the problems, no?  Maybe take a look at the checkpoint-side code
> in MOM.  And make sure the job is going into a hold state successfully.
> Your problem might be related to not using the checkpoint file that Torque
> asked you to create.
> I haven't decided how to handle the pbs_demux process.  I think when MOM
> restarts a job, a new pbs_demux is started.  We'll need to make sure that
> the new pbs_demux is used.  There's no reason to try to checkpoint it,
> though we'll need to know if it has any output or input buffering.
> ompi-checkpoint sends a command (through mpirun) out to the rest of the
> openmpi runtime that does the necessary coordination to checkpoint the job.
> When the coordination is completed, each openmpi rank starts a new BLCR
> request to checkpoint itself.  I believe openmpi-checkpoint returns once
> that step is complete, and prints out a name that can be passed to 
> ompi-restart.
> It's not so much the batch system that sends the signal to mpi.  openmpi
> really wants the user to use ompi-checkpoint, not a signal.  This is important.
> The trick to making this whole thing work is to ensure that no checkpoint
> signals are sent directly to the mpirun process or to the MPI ranks
> on the node.  When that signal is received, we need something to respond by
> calling ompi-checkpoint, and blocking the checkpoint request until the
> ompi-checkpoint is completed.  On restart, this same thing needs to call
> ompi-restart.
> I've written code to nest the checkpoints correctly, but need to write some
> more to send the signals to the right places before parallel jobs will work
> correctly.  The plan is basically to start a new session within the job,
> and run mpirun in that second session.
> Session 1
> pbs_mom -- sh -- proxy_process
> Session 2
> slave_process -- mpirun -- local ranks
> Session 1 gets checkpointed by Torque.  The proxy_process and slave_process
> are connected with a pipe (or 2), and all of the standard IO streams
> are connected.  The pipes are there to pass checkpoint and restart requests,
> (and completion of those requests), between session 1 to session 2.  
> And as I recall, I have the checkpoint/restart of the proxy working correctly
> outside of torque, but not inside.  
> I'm not sure yet how to fit pbs_demux into here.  Is blcr_restart_script
> connected to pbs_demux when it starts?  (meaning, are the file descriptors
> already open?)  If so, BLCR should be able to reconnect everything correctly.
> I'd appreciate comments or questions on this.  Right now, I don't see too
> many other ways to get everything working.  
> Eric
> On Tue, Jun 29, 2010 at 11:25:54AM +0200, Danny Sternkopf wrote:
>> Eric,
>> thanks for your answer.
>> Please find my scripts attached. They are quite simple at the
>> moment. They use ompi-checkpoint and ompi-restart for OpenMPI apps.
>> They must use it at the moment. For Multi-node jobs Torque starts a
>> pbs_demux process for capturing stdout/err of all MOMs which is not
>> checkpointable. Probably because it is not started with cr_run or
>> under similiar BLCR environment. Therefore cr_checkpoint --tree
>> doesn't work for these kind of jobs.
>> I do not have a deep knowledge what exactly ompi-checkpoint does.
>> But I can imagine that it brings the MPI app in a proper state,
>> ready for checkpointing. Therefore I'am nore sure if you could
>> succesfully checkpoint/restart the MPI app without the knowledge
>> about the app requirements. For example communication is one of the
>> major challenges which plays a important role in that context.
>> In general one would expect that the batch system sends a signal to
>> MPI which then brings the app in a proper state for checkpointing
>> and then the batch system just performs checkpoints for all the
>> involved processes uses a general format. This could be one
>> approach.
>> However thats why ompi-checkpoint/ompi-restart exist and which take
>> care of of this and which can make use of BLCR tools of course.
>> As I said the checkpinting works for me because ompi-checkpoint does
>> a good job and checkpoints, then terminates the MPI related
>> processes. Torque takes care of pbs_demux and the batch job script.
>> The restart does not work. I don't see that the blcr_restart_script
>> is called. So I guess Torque has a problem to find the expected
>> checkpoint file.
>> The following article gives useful information about the current
>> Torque/BLCR integration:
>>  http://www.clusterresources.com/pipermail/torquedev/2010-May/002054.html
>> Regards,
>> Danny
>> On 6/28/2010 8:08 PM, Eric Roman wrote:
>>> Danny,
>>> I worked on this a while ago, but it's been a long standing todo item to get
>>> everything to work properly.
>>> Can you tell me what your scripts do?
>>> Can you restart the application manually from the context file you created?
>>> (With cr_restart?)
>>> Tormally, torque tries to checkpoint the shell it spawned assoc'd with the job
>>> using cr_checkpoint (--tree) to capture all of the children, including the
>>> mpirun and the MPI ranks.  Last time I checked, mpirun wouldn't respond to a
>>> cr_checkpoint.  (I think it omitted itself from the checkpoint, but I don't
>>> remember).  openmpi required a user to invoke ompi-checkpoint to checkpoint an
>>> app, and ompi-restart to bring the app back, but torque wants to use
>>> cr_checkpoint and cr_restart on the context file.  So, I needed to wrap
>>> the original openmpi mpirun with another program that would intercept the
>>> checkpoint signals.
>>> Part of the problem is that openmpi puts some of the MPI rank into the same
>>> process tree (or session) as the mpirun, and this messes everything up.  I
>>> left off at the point where I needed to write startup code to ensure that
>>> the ranks were in a separate process tree from the mpirun.  (The way things
>>> are implemented right now, the checkpoint deadlocks, so we need to break
>>> one of the dependencies to fix it.)
>>> The root issue is a little bit messy.  Those checkpoint/restart scripts
>>> need root privileges to open the context file.  Those scripts need to open
>>> the context file (as root), and then call setuid() to change into the user,
>>> making sure that they pass the context file as a file descriptor to
>>> cr_checkpoint and cr_restart.
>>> I do want to go in and fix all of this.  Right now I'm trying to get BLCR to
>>> work with compressed context files, and chasing a bug with using it on
>>> the 2.6.33 kernel.
>>> Eric
>>> On Mon, Jun 28, 2010 at 09:43:14AM +0200, Danny Sternkopf wrote:
>>>> Hi,
>>>> maybe someone here can comments on this.
>>>> Regards,
>>>> Danny
>>>> -------- Original Message --------
>>>> Subject: Re: [torqueusers] torque+blcr+openmpi
>>>> Date: Fri, 25 Jun 2010 16:58:59 +0200
>>>> From: Danny Sternkopf<dsternkopf at hpce.nec.com>
>>>> Reply-To: dsternkopf at hpce.nec.com
>>>> Organization: NEC Deutschland GmbH
>>>> To: torqueusers at supercluster.org
>>>> Hi,
>>>> any news about this? I have the following setup:
>>>> o torque 2.4.8
>>>> o openmpi 1.4.2
>>>> o blcr 0.8.2
>>>> The checkpoint/restart scripts from Torque's contrib/blcr work for
>>>> single node application without MPI. I created new scripts for OpenMPI
>>>> applications. The checkpoint works, but the release does not. The issue
>>>> might be that ompi-checkpoint writes a directory including checkpoint
>>>> files for each process plus metadata and Torque expects one single
>>>> checkpoint file. Any experiences?
>>>> Btw another issue is that the checkpoint/restart scripts run as root.
>>>> ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
>>>> have to run the ompi-checkpoint as user. The restart script of course
>>>> needs this as well to restart process under the corresponding user id.
>>>> Furthermore any comments to handle MPI and single process applications
>>>> with same checkpoint/restart scripts?
>>>> Regards,
>>>> Danny
>>>> ---
>>>> _______________________________________________
>>>> torquedev mailing list
>>>> torquedev at supercluster.org
>>>> http://www.supercluster.org/mailman/listinfo/torquedev
>> -- 
>> Danny Sternkopf http://www.nec.de/hpc        dsternkopf at hpce.nec.com
>> HPCE Division  Germany phone: +49-711-78055-33 fax: +49-711-78055-25
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> NEC Deutschland GmbH, Hansaallee 101, 40549 Düsseldorf
>> Geschäftsführer Richard Hanscott
>> Handelsregister Düsseldorf HRB 57941; VAT ID DE129424743
>> #!/bin/bash
>> usage() {
>>         echo -e "usage: $0 <session_id> <job_id> <user_id> <group_id> <checkpoint_dir> <checkpoint_name>\n"
>>         exit 1
>> }
>> ## main ##
>> if [ $# -eq 6 ]; then
>>         sessionId=$1
>>         jobId=$2
>>         userId=$3
>>         groupId=$4
>>         checkpointDir=$5
>>         checkpointName=$6
>> else
>>         usage
>> fi
>> ## main ##
>> read MPIRUN_PID < ${checkpointDir}/mpirun.pid.$jobId
>> echo "ompi-restart -mca snapc_base_global_snapshot_dir $checkpointDir ompi_global_snapshot_${MPIRUN_PID}.ckpt" 1>&2
>> su - nectest -c "ompi-restart -mca snapc_base_global_snapshot_dir $checkpointDir ompi_global_snapshot_${MPIRUN_PID}.ckpt"
>> #!/bin/bash
>> usage() {
>> 	echo -e "usage: $0 <session_id> <job_id> <user_id> <group_id> <checkpoint_dir> <checkpoint_name> <signal_num> <checkpoint_depth>\n"
>> 	exit 1
>> }
>> ## main ##
>> if [ $# -eq 8 ]; then
>> 	sessionId=$1
>> 	jobId=$2
>> 	userId=$3
>> 	groupId=$4
>> 	checkpointDir=$5
>> 	checkpointName=$6
>> 	signalNum=$7
>> 	checkpointDepth=$8
>> else
>> 	usage
>> fi
>> read MPIRUN_PID < ${checkpointDir}/mpirun.pid.$jobId
>> echo "ompi-checkpoint -mca snapc_base_global_snapshot_dir $checkpointDir $MPIRUN_PID" 1>&2
>> su - nectest -c "ompi-checkpoint --term -mca snapc_base_global_snapshot_dir $checkpointDir $MPIRUN_PID"
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blcr_checkpoint_script
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100701/5f8c2a17/attachment.pl 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blcr_restart_script
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100701/5f8c2a17/attachment-0001.pl 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ql-restart-torque-ompi-job
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20100701/5f8c2a17/attachment-0002.pl 

More information about the torquedev mailing list