[torquedev] torque+blcr+openmpi

Danny Sternkopf dsternkopf at hpce.nec.com
Mon Jun 28 01:43:14 MDT 2010


maybe someone here can comments on this.



-------- Original Message --------
Subject: Re: [torqueusers] torque+blcr+openmpi
Date: Fri, 25 Jun 2010 16:58:59 +0200
From: Danny Sternkopf <dsternkopf at hpce.nec.com>
Reply-To: dsternkopf at hpce.nec.com
Organization: NEC Deutschland GmbH
To: torqueusers at supercluster.org


any news about this? I have the following setup:
o torque 2.4.8
o openmpi 1.4.2
o blcr 0.8.2

The checkpoint/restart scripts from Torque's contrib/blcr work for
single node application without MPI. I created new scripts for OpenMPI
applications. The checkpoint works, but the release does not. The issue
might be that ompi-checkpoint writes a directory including checkpoint
files for each process plus metadata and Torque expects one single
checkpoint file. Any experiences?

Btw another issue is that the checkpoint/restart scripts run as root.
ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
have to run the ompi-checkpoint as user. The restart script of course
needs this as well to restart process under the corresponding user id.

Furthermore any comments to handle MPI and single process applications
with same checkpoint/restart scripts?



More information about the torquedev mailing list