[torqueusers] torque+blcr+openmpi

Danny Sternkopf dsternkopf at hpce.nec.com
Fri Jun 25 08:58:59 MDT 2010


any news about this? I have the following setup:
o torque 2.4.8
o openmpi 1.4.2
o blcr 0.8.2

The checkpoint/restart scripts from Torque's contrib/blcr work for 
single node application without MPI. I created new scripts for OpenMPI 
applications. The checkpoint works, but the release does not. The issue 
might be that ompi-checkpoint writes a directory including checkpoint 
files for each process plus metadata and Torque expects one single 
checkpoint file. Any experiences?

Btw another issue is that the checkpoint/restart scripts run as root. 
ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you 
have to run the ompi-checkpoint as user. The restart script of course 
needs this as well to restart process under the corresponding user id.

Furthermore any comments to handle MPI and single process applications 
with same checkpoint/restart scripts?


On 3/13/2010 8:39 AM, Chris Samuel wrote:
> On Tue, 23 Feb 2010 09:15:27 pm Anton Starikov wrote:
>> Can anyone provide example of checkpoint script for torque which deals with
>> open-mpi checkpointing?
> I too would be very interested in this - I am pondering trying BLCR on our new
> clusters at VLSCI..
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list