pk at q-leap.com
Tue Jul 6 05:04:44 MDT 2010
Danny Sternkopf wrote:
> thank you Peter!
I'd be glad if you find them useful.
> I have the impression that PBS Mom calls the checkpoint and restart
> scripts only once. Therefore the scripts must take care of the total
> batch job and all processes belonging to it, right?
but that seems not possible because with ompi-checkpoint you have
to checkpoint the orterun process.
> Do these scripts really work for you?
yes, they do, but possibly not the way you outlined above.
Having said that the scripts most likely are not perfect and
it would be great if we could improve them.
> I see that the MPI processes are checkpointed, but the mpirun and the
> batch scripts keeps running.
this is right because if you checkpoint a job it does not mean
to terminate it as well, you get a checkpoint file which you can
use to resume the job. If the seventh argument to the checkpoint script is
"15" then the process is also terminated.
> Btw. mpiexec and mpirun are equivalent synonyms for orterun. They should
> be matched in the pgrep as well.
Good point, the line should then look like:
my $ortpid=`pgrep -g $sessionId 'orterun|mpiexec|mpirun'`;
More information about the torquedev