ERoman at lbl.gov
Wed Jul 28 19:29:22 MDT 2010
I have some code that fixes most of the issues here. I've been able to
checkpoint and restart openMPI jobs using qhold and qrls.
The bugzilla bug that you filed is still causing me problems:
Did you ever work out a fix for this?
If you're interested in doing some testing, I can send you the code. It's
still in rough shape, but I could use more eyes on this.
On Mon, Jun 28, 2010 at 09:43:14AM +0200, Danny Sternkopf wrote:
> maybe someone here can comments on this.
> -------- Original Message --------
> Subject: Re: [torqueusers] torque+blcr+openmpi
> Date: Fri, 25 Jun 2010 16:58:59 +0200
> From: Danny Sternkopf <dsternkopf at hpce.nec.com>
> Reply-To: dsternkopf at hpce.nec.com
> Organization: NEC Deutschland GmbH
> To: torqueusers at supercluster.org
> any news about this? I have the following setup:
> o torque 2.4.8
> o openmpi 1.4.2
> o blcr 0.8.2
> The checkpoint/restart scripts from Torque's contrib/blcr work for
> single node application without MPI. I created new scripts for OpenMPI
> applications. The checkpoint works, but the release does not. The issue
> might be that ompi-checkpoint writes a directory including checkpoint
> files for each process plus metadata and Torque expects one single
> checkpoint file. Any experiences?
> Btw another issue is that the checkpoint/restart scripts run as root.
> ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
> have to run the ompi-checkpoint as user. The restart script of course
> needs this as well to restart process under the corresponding user id.
> Furthermore any comments to handle MPI and single process applications
> with same checkpoint/restart scripts?
> torquedev mailing list
> torquedev at supercluster.org
More information about the torquedev