[torqueusers] Torque/maui node failure policy revisted again

Glen Beane glen.beane at gmail.com
Tue Dec 16 05:29:17 MST 2008


On Mon, Dec 15, 2008 at 4:54 PM, Chris Samuel <csamuel at vpac.org> wrote:
>
> ----- Charles at Schwieters.org wrote:
>
>>   I saw a message with this subject from June, 2007 along with a
>> patch creating a fatal_job_poll_failure mom_priv/config option.
>> This option prevents the failure of a single node deleting an
>> entire job.
>
> Could you explain why you would not want the existing
> behaviour please ?
>
> For all codes that I'm aware of at present losing a
> single node of a parallel job means the parallel job
> has failed so I'd be really interested to hear where
> that's not necessarily the case.


Some MPI implementations allow the job to recover if a rank is lost
(it is up to the programmer to recover, but the application doesn't
crash).  I think maybe PACX MPI allowed a programmer to recover if a
rank was lost (I never used it so I'm not 100% sure).  I have
work-pool type MPI applications that I've developed that could be
modified to cope with loosing ranks other than rank 0.  I believe that
Jeff Squires was talking about adding fault tolerance to OpenMPI.
With a modern cluster loosing a node means your're probably loosing 8
ranks - lots of programs would be able to recover for this (or the
code would just get to nasty I think), but work-pool type programs
just need to send out the work units that were lost to new ranks, not
too big of a deal.


This may be a good feature to add triggered by a per job setting.


More information about the torqueusers mailing list