mailmaverick666 at gmail.com
Tue Jul 6 02:12:00 MDT 2010
Is there a need for checkpointing mpirun/mpiexec
processes(Please correct me if I am wrong). They are spawning MPI program on
defined nodes. For restarting a checkpointed MPI program, a fresh instance
of mpirun, mpiexec or pbsdsh can be used.
On Mon, Jul 5, 2010 at 8:09 PM, Danny Sternkopf <dsternkopf at hpce.nec.com>wrote:
> thank you Peter!
> I have the impression that PBS Mom calls the checkpoint and restart
> scripts only once. Therefore the scripts must take care of the total
> batch job and all processes belonging to it, right?
> Do these scripts really work for you?
> I see that the MPI processes are checkpointed, but the mpirun and the
> batch scripts keeps running.
> Btw. mpiexec and mpirun are equivalent synonyms for orterun. They should
> be matched in the pgrep as well.
> Best regards,
> On 7/1/2010 1:29 PM, Peter Kruse wrote:
> > Hello,
> > I attach the two scripts that we use. They are based on the scripts
> > found on
> > But with these additions:
> > blcr_checkpoint_script:
> > 1. support ompi-checkpoint and cr_checkpoint
> > it checks if orterun is a parent process, if so uses ompi-checkpoint
> > otherweise uses cr_checkpoint
> > 2. for ompi-checkpoint the checkpoint directory cannot be given on
> > commandline, orterun uses the parameter snapc_base_global_snapshot_dir
> > which is already set. Therefore ignore the
> > Instead store a mapping of $JOBID:$Snapref in a file (where $Snapref is
> > returned by the ompi-checkpoint command). Additionally store the
> > node geometry which is used in a script that restarts the job.
> > blcr_restart_script:
> > 3. if the given $jobid is found in the jobid2ompi_snap_ref file then
> > use "ompi-restart $ref" otherwise use cr_restart with the given
> > checkpointFile.
> > ql-restart-torque-ompi-job:
> > this script is meant to be run in a Torque Job, so that
> > $PBS_NODEILFE is set. If given the JobID to restart
> > it will first check if the node geometry matches the one
> > of that job, if it matches then calls ompi-restart with
> > the snapshot reference.
> > I hope they may be useful for you.
> > Regards,
> > Peter
> torquedev mailing list
> torquedev at supercluster.org
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torquedev