[torqueusers] MPI checkpointing and Torque/Moab
Chris Samuel
csamuel at vpac.org
Thu Dec 7 16:40:51 MST 2006
On Thursday 07 December 2006 03:49, Chris Wilk wrote:
> I would rather avoid using kernel-level checkpointing e.g. BLCR.
Sounds like what you really want is the scientifc code to implement
checkpointing itself. Some codes (such as NAMD) already do this, if your job
gets killed you can restart it later from the last checkpoint it saved.
The problem with doing MPI checkpoint/restart outside of the application is
that if you have special interconnects (such as Myrinet) you may find that
the device driver is pinning DMA memory on the cards for the MPI tasks
buffers and my guess is that you'd need support in the MPI stack & kernel
driver to save, manipulate and release & reload these areas and I know of
none that can at the moment (though I'd be *really* interested & *very* happy
to find out that I'm wrong!).
cheers,
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061208/33f9c645/attachment.bin
More information about the torqueusers
mailing list