[torqueusers] Torque/maui node failure policy revisted again

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Dec 16 05:39:35 MST 2008


On Tue, 16 Dec 2008, Glen Beane wrote:

> I think maybe PACX MPI allowed a programmer to recover if a rank was 
> lost (I never used it so I'm not 100% sure).

LA MPI has fault tolerance:

http://public.lanl.gov/lampi/

> I believe that Jeff Squires was talking about adding fault tolerance 
> to OpenMPI.

The LANL group behind LA MPI takes part now in developing OpenMPI. At 
least for the moment there is no such fault tolerance in OpenMPI as it 
was in LA MPI, but the plan is indeed to add something similar.

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de


More information about the torqueusers mailing list