[torqueusers] Torque/maui node failure policy revisted again
Craig Macdonald
craigm at dcs.gla.ac.uk
Tue Dec 16 05:39:12 MST 2008
Hadoop on Demand (HOD), which is allows a Hadoop cluster to be
instantiated on top of Torque, is resilient to node failures.
Craig
Chris Samuel wrote:
> ----- Charles at Schwieters.org wrote:
>
>
>> I saw a message with this subject from June, 2007 along with a
>> patch creating a fatal_job_poll_failure mom_priv/config option.
>> This option prevents the failure of a single node deleting an
>> entire job.
>>
>
> Could you explain why you would not want the existing
> behaviour please ?
>
> For all codes that I'm aware of at present losing a
> single node of a parallel job means the parallel job
> has failed so I'd be really interested to hear where
> that's not necessarily the case.
>
> cheers,
> Chris
>
More information about the torqueusers
mailing list