[torquedev] BLCR checkpoint and restart checked into trunk

Dave Jackson jacksond at clusterresources.com
Thu Feb 14 13:04:47 MST 2008


Eric,

  Thank you for your offer.  We would very much want to take you up on
this.  Do you have time tomorrow?  

Thanks,
Dave

On Thu, 2008-02-14 at 11:09 -0800, Eric Roman wrote:
> Hi,
> 
> I'm Eric Roman.  I work at LBL on the BLCR project with Paul Hargrove.
> I'm writing to offer help from our side to make BLCR and Torque work
> together.  This has been one of our goals for a long time now.
> 
> I sent in a patch a few months ago to the torquedev list with some basic
> checkpoint/restart support for Torque 2.1.9.  I used library preloading
> to bypass cr_run by assuming that libcr.so was in /etc/ld.so.preload, so
> there's no need to use cr_run to prelink everything.  (An alternative
> might be to set the LD_PRELOAD environment variable in MOM while
> forking.)  This should help with keeping things checkpointable.
> 
> I ran into some issues with MOM's internal state with respect to the
> checkpoints, (they should be in the patch) but in the end I was able to
> qhold and qrls a bash script that counted to 100, some NAS Serial
> Benchmarks.
> 
> We've done some work with Open MPI guys to get parallel
> checkpoint/restart working.  I haven't followed up on this recently, but
> there's some code out there somewhere to integrate ompi-checkpoint and
> ompi-restart into BLCR's checkpoint sequence.  We needed to set some
> flags up to ignore spurious checkpoint signals, and wrap ompi-checkpoint
> and mpirun with a handler that forwards the BLCR checkpoint request to
> the MPI application.
> 
> I modified BLCR to restore process credentials (UID, GID, UNIX
> groups) on 2.6 kernels.  I'm not sure if that's in a released version,
> but it is in our CVS head.
> 
> I have some library calls that you can use now to request checkpoints
> from user space without using the cr_checkpoint executable.  We also
> have plans to do something similar for restart.  That should make it
> much easier to distinguish abnormal termination from cr_restart from
> abnormal termination from the restarted process.
> 
> I've attached the old patch for 2.1.9.
> 
> Should we talk on the phone?
> 
> Best Wishes,
> Eric
> 
> On Wed, Feb 13, 2008 at 06:59:07PM -0700, Steve Snelgrove wrote:
> >
> > I checked in the BCLR checkpoint and restart changes into the trunk (2.3)
> > and made a snapshot release.  There might still be a few small problems
> > with this code but it does at least seem to work.
> >
> > I have done a documentation page for anyone who is interested in this
> > feature.
> >
> >  http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml
> >
> >
> > _______________________________________________
> > torquedev mailing list
> > torquedev at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torquedev
> 



More information about the torquedev mailing list