dsternkopf at hpce.nec.com
Mon Jul 5 08:39:40 MDT 2010
thank you Peter!
I have the impression that PBS Mom calls the checkpoint and restart
scripts only once. Therefore the scripts must take care of the total
batch job and all processes belonging to it, right?
Do these scripts really work for you?
I see that the MPI processes are checkpointed, but the mpirun and the
batch scripts keeps running.
Btw. mpiexec and mpirun are equivalent synonyms for orterun. They should
be matched in the pgrep as well.
On 7/1/2010 1:29 PM, Peter Kruse wrote:
> I attach the two scripts that we use. They are based on the scripts
> found on
> But with these additions:
> 1. support ompi-checkpoint and cr_checkpoint
> it checks if orterun is a parent process, if so uses ompi-checkpoint
> otherweise uses cr_checkpoint
> 2. for ompi-checkpoint the checkpoint directory cannot be given on
> commandline, orterun uses the parameter snapc_base_global_snapshot_dir
> which is already set. Therefore ignore the $checkpointDir/$checkpointName.
> Instead store a mapping of $JOBID:$Snapref in a file (where $Snapref is
> returned by the ompi-checkpoint command). Additionally store the
> node geometry which is used in a script that restarts the job.
> 3. if the given $jobid is found in the jobid2ompi_snap_ref file then
> use "ompi-restart $ref" otherwise use cr_restart with the given
> this script is meant to be run in a Torque Job, so that
> $PBS_NODEILFE is set. If given the JobID to restart
> it will first check if the node geometry matches the one
> of that job, if it matches then calls ompi-restart with
> the snapshot reference.
> I hope they may be useful for you.
More information about the torquedev