[torquedev] torque+blcr+openmpi

rishi pathak mailmaverick666 at gmail.com
Tue Jul 6 02:12:00 MDT 2010


Hi Danny,
                  Is there a need for checkpointing mpirun/mpiexec
processes(Please correct me if I am wrong). They are spawning MPI program on
defined nodes. For restarting a checkpointed MPI program, a fresh instance
of mpirun, mpiexec or pbsdsh can be used.

On Mon, Jul 5, 2010 at 8:09 PM, Danny Sternkopf <dsternkopf at hpce.nec.com>wrote:

> Hi,
>
> thank you Peter!
>
> I have the impression that PBS Mom calls the checkpoint and restart
> scripts only once. Therefore the scripts must take care of the total
> batch job and all processes belonging to it, right?
>
> Do these scripts really work for you?
>
> I see that the MPI processes are checkpointed, but the mpirun and the
> batch scripts keeps running.
>
> Btw. mpiexec and mpirun are equivalent synonyms for orterun. They should
> be matched in the pgrep as well.
>
> Best regards,
>
> Danny
>
> On 7/1/2010 1:29 PM, Peter Kruse wrote:
> > Hello,
> >
> > I attach the two scripts that we use. They are based on the scripts
> > found on
> >
> http://www.clusterresources.com/products/torque/docs/2.6jobcheckpoint.shtml
> > But with these additions:
> >
> > blcr_checkpoint_script:
> >
> > 1. support ompi-checkpoint and cr_checkpoint
> > it checks if orterun is a parent process, if so uses ompi-checkpoint
> > otherweise uses cr_checkpoint
> > 2. for ompi-checkpoint the checkpoint directory cannot be given on
> > commandline, orterun uses the parameter snapc_base_global_snapshot_dir
> > which is already set. Therefore ignore the
> $checkpointDir/$checkpointName.
> > Instead store a mapping of $JOBID:$Snapref in a file (where $Snapref is
> > returned by the ompi-checkpoint command). Additionally store the
> > node geometry which is used in a script that restarts the job.
> >
> > blcr_restart_script:
> >
> > 3. if the given $jobid is found in the jobid2ompi_snap_ref file then
> > use "ompi-restart $ref" otherwise use cr_restart with the given
> > checkpointFile.
> >
> > ql-restart-torque-ompi-job:
> >
> > this script is meant to be run in a Torque Job, so that
> > $PBS_NODEILFE is set. If given the JobID to restart
> > it will first check if the node geometry matches the one
> > of that job, if it matches then calls ompi-restart with
> > the snapshot reference.
> >
> > I hope they may be useful for you.
> >
> > Regards,
> >
> > Peter
> >
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev
>



-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torquedev/attachments/20100706/05ebee84/attachment.html 


More information about the torquedev mailing list