[torqueusers] MPI checkpointing and Torque/Moab

Chris Samuel csamuel at vpac.org
Thu Dec 7 16:40:51 MST 2006


On Thursday 07 December 2006 03:49, Chris Wilk wrote:

> I would rather avoid using kernel-level checkpointing e.g. BLCR.

Sounds like what you really want is the scientifc code to implement 
checkpointing itself.  Some codes (such as NAMD) already do this, if your job 
gets killed you can restart it later from the last checkpoint it saved.

The problem with doing MPI checkpoint/restart outside of the application is 
that if you have special interconnects (such as Myrinet) you may find that 
the device driver is pinning DMA memory on the cards for the MPI tasks 
buffers and my guess is that you'd need support in the MPI stack & kernel 
driver to save, manipulate and release & reload these areas and I know of 
none that can at the moment (though I'd be *really* interested & *very* happy 
to find out that I'm wrong!).

cheers,
Chris
-- 
 Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20061208/33f9c645/attachment.bin


More information about the torqueusers mailing list