efocht at gmail.com
Thu Jul 29 04:21:24 MDT 2010
my colleague Danny is currently in a longer vacation and I'll take over
the testing. So yes, please send me the code for testing!
Regarding the bug, I remember that it was just hardcoded to bypass the
comparison for testing. That was not meant as a proper fix... We weren't
sure what the proper fix would be, whether we actually need to make sure
that the host name is the same or not.
On 07/29/2010 03:29 AM, Eric Roman wrote:
> I have some code that fixes most of the issues here. I've been able to
> checkpoint and restart openMPI jobs using qhold and qrls.
> The bugzilla bug that you filed is still causing me problems:
> Did you ever work out a fix for this?
> If you're interested in doing some testing, I can send you the code. It's
> still in rough shape, but I could use more eyes on this.
> On Mon, Jun 28, 2010 at 09:43:14AM +0200, Danny Sternkopf wrote:
>> maybe someone here can comments on this.
>> -------- Original Message --------
>> Subject: Re: [torqueusers] torque+blcr+openmpi
>> Date: Fri, 25 Jun 2010 16:58:59 +0200
>> From: Danny Sternkopf <dsternkopf at hpce.nec.com>
>> Reply-To: dsternkopf at hpce.nec.com
>> Organization: NEC Deutschland GmbH
>> To: torqueusers at supercluster.org
>> any news about this? I have the following setup:
>> o torque 2.4.8
>> o openmpi 1.4.2
>> o blcr 0.8.2
>> The checkpoint/restart scripts from Torque's contrib/blcr work for
>> single node application without MPI. I created new scripts for OpenMPI
>> applications. The checkpoint works, but the release does not. The issue
>> might be that ompi-checkpoint writes a directory including checkpoint
>> files for each process plus metadata and Torque expects one single
>> checkpoint file. Any experiences?
>> Btw another issue is that the checkpoint/restart scripts run as root.
>> ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
>> have to run the ompi-checkpoint as user. The restart script of course
>> needs this as well to restart process under the corresponding user id.
>> Furthermore any comments to handle MPI and single process applications
>> with same checkpoint/restart scripts?
>> torquedev mailing list
>> torquedev at supercluster.org
> torquedev mailing list
> torquedev at supercluster.org
More information about the torquedev