[torqueusers] Torque/maui node failure policy revisted again

Craig Macdonald craigm at dcs.gla.ac.uk
Tue Dec 16 05:39:12 MST 2008


Hadoop on Demand (HOD), which is allows a Hadoop cluster to be 
instantiated on top of Torque, is resilient to node failures.

Craig

Chris Samuel wrote:
> ----- Charles at Schwieters.org wrote:
>
>   
>>   I saw a message with this subject from June, 2007 along with a
>> patch creating a fatal_job_poll_failure mom_priv/config option.
>> This option prevents the failure of a single node deleting an
>> entire job.
>>     
>
> Could you explain why you would not want the existing
> behaviour please ?
>
> For all codes that I'm aware of at present losing a
> single node of a parallel job means the parallel job
> has failed so I'd be really interested to hear where
> that's not necessarily the case.
>
> cheers,
> Chris
>   



More information about the torqueusers mailing list