[torquedev] BLCR checkpoint and restart checked into trunk
Dave Jackson
jacksond at clusterresources.com
Thu Feb 14 13:04:47 MST 2008
Eric,
Thank you for your offer. We would very much want to take you up on
this. Do you have time tomorrow?
Thanks,
Dave
On Thu, 2008-02-14 at 11:09 -0800, Eric Roman wrote:
> Hi,
>
> I'm Eric Roman. I work at LBL on the BLCR project with Paul Hargrove.
> I'm writing to offer help from our side to make BLCR and Torque work
> together. This has been one of our goals for a long time now.
>
> I sent in a patch a few months ago to the torquedev list with some basic
> checkpoint/restart support for Torque 2.1.9. I used library preloading
> to bypass cr_run by assuming that libcr.so was in /etc/ld.so.preload, so
> there's no need to use cr_run to prelink everything. (An alternative
> might be to set the LD_PRELOAD environment variable in MOM while
> forking.) This should help with keeping things checkpointable.
>
> I ran into some issues with MOM's internal state with respect to the
> checkpoints, (they should be in the patch) but in the end I was able to
> qhold and qrls a bash script that counted to 100, some NAS Serial
> Benchmarks.
>
> We've done some work with Open MPI guys to get parallel
> checkpoint/restart working. I haven't followed up on this recently, but
> there's some code out there somewhere to integrate ompi-checkpoint and
> ompi-restart into BLCR's checkpoint sequence. We needed to set some
> flags up to ignore spurious checkpoint signals, and wrap ompi-checkpoint
> and mpirun with a handler that forwards the BLCR checkpoint request to
> the MPI application.
>
> I modified BLCR to restore process credentials (UID, GID, UNIX
> groups) on 2.6 kernels. I'm not sure if that's in a released version,
> but it is in our CVS head.
>
> I have some library calls that you can use now to request checkpoints
> from user space without using the cr_checkpoint executable. We also
> have plans to do something similar for restart. That should make it
> much easier to distinguish abnormal termination from cr_restart from
> abnormal termination from the restarted process.
>
> I've attached the old patch for 2.1.9.
>
> Should we talk on the phone?
>
> Best Wishes,
> Eric
>
> On Wed, Feb 13, 2008 at 06:59:07PM -0700, Steve Snelgrove wrote:
> >
> > I checked in the BCLR checkpoint and restart changes into the trunk (2.3)
> > and made a snapshot release. There might still be a few small problems
> > with this code but it does at least seem to work.
> >
> > I have done a documentation page for anyone who is interested in this
> > feature.
> >
> > http://www.clusterresources.com/torquedocs21/2.6jobcheckpoint.shtml
> >
> >
> > _______________________________________________
> > torquedev mailing list
> > torquedev at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torquedev
>
More information about the torquedev
mailing list