[torquedev] torque+blcr+openmpi

Erich Focht efocht at gmail.com
Thu Jul 29 04:21:24 MDT 2010


Hi Eric,

my colleague Danny is currently in a longer vacation and I'll take over
the testing. So yes, please send me the code for testing!

Regarding the bug, I remember that it was just hardcoded to bypass the
comparison for testing. That was not meant as a proper fix... We weren't
sure what the proper fix would be, whether we actually need to make sure
that the host name is the same or not.

Best regards,
Erich


On 07/29/2010 03:29 AM, Eric Roman wrote:
> 
> Danny,
> 
> I have some code that fixes most of the issues here.  I've been able to
> checkpoint and restart openMPI jobs using qhold and qrls.
> 
> The bugzilla bug that you filed is still causing me problems:
> http://www.clusterresources.com/bugzilla/show_bug.cgi?id=68
> 
> Did you ever work out a fix for this?
> 
> If you're interested in doing some testing, I can send you the code.  It's
> still in rough shape, but I could use more eyes on this.
> 
> Eric
> 
> On Mon, Jun 28, 2010 at 09:43:14AM +0200, Danny Sternkopf wrote:
>> Hi,
>>
>> maybe someone here can comments on this.
>>
>> Regards,
>>
>> Danny
>>
>> -------- Original Message --------
>> Subject: Re: [torqueusers] torque+blcr+openmpi
>> Date: Fri, 25 Jun 2010 16:58:59 +0200
>> From: Danny Sternkopf <dsternkopf at hpce.nec.com>
>> Reply-To: dsternkopf at hpce.nec.com
>> Organization: NEC Deutschland GmbH
>> To: torqueusers at supercluster.org
>>
>> Hi,
>>
>> any news about this? I have the following setup:
>> o torque 2.4.8
>> o openmpi 1.4.2
>> o blcr 0.8.2
>>
>> The checkpoint/restart scripts from Torque's contrib/blcr work for
>> single node application without MPI. I created new scripts for OpenMPI
>> applications. The checkpoint works, but the release does not. The issue
>> might be that ompi-checkpoint writes a directory including checkpoint
>> files for each process plus metadata and Torque expects one single
>> checkpoint file. Any experiences?
>>
>> Btw another issue is that the checkpoint/restart scripts run as root.
>> ompi-checkpoint doesn't allow that root can checkpoint user jobs. So you
>> have to run the ompi-checkpoint as user. The restart script of course
>> needs this as well to restart process under the corresponding user id.
>>
>> Furthermore any comments to handle MPI and single process applications
>> with same checkpoint/restart scripts?
>>
>> Regards,
>>
>> Danny
>> ---
>> _______________________________________________
>> torquedev mailing list
>> torquedev at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torquedev
> _______________________________________________
> torquedev mailing list
> torquedev at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torquedev


More information about the torquedev mailing list