dsternkopf at hpce.nec.com
Fri Jun 25 08:58:59 MDT 2010
any news about this? I have the following setup:
o torque 2.4.8
o openmpi 1.4.2
o blcr 0.8.2
The checkpoint/restart scripts from Torque's contrib/blcr work for
single node application without MPI. I created new scripts for OpenMPI
applications. The checkpoint works, but the release does not. The issue
might be that ompi-checkpoint writes a directory including checkpoint
files for each process plus metadata and Torque expects one single
checkpoint file. Any experiences?
Btw another issue is that the checkpoint/restart scripts run as root.
ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
have to run the ompi-checkpoint as user. The restart script of course
needs this as well to restart process under the corresponding user id.
Furthermore any comments to handle MPI and single process applications
with same checkpoint/restart scripts?
On 3/13/2010 8:39 AM, Chris Samuel wrote:
> On Tue, 23 Feb 2010 09:15:27 pm Anton Starikov wrote:
>> Can anyone provide example of checkpoint script for torque which deals with
>> open-mpi checkpointing?
> I too would be very interested in this - I am pondering trying BLCR on our new
> clusters at VLSCI..
> torqueusers mailing list
> torqueusers at supercluster.org
More information about the torqueusers