[torqueusers] Torque/maui node failure policy revisted again

Chris Samuel csamuel at vpac.org
Mon Dec 15 14:54:06 MST 2008


----- Charles at Schwieters.org wrote:

>   I saw a message with this subject from June, 2007 along with a
> patch creating a fatal_job_poll_failure mom_priv/config option.
> This option prevents the failure of a single node deleting an
> entire job.

Could you explain why you would not want the existing
behaviour please ?

For all codes that I'm aware of at present losing a
single node of a parallel job means the parallel job
has failed so I'd be really interested to hear where
that's not necessarily the case.

cheers,
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


More information about the torqueusers mailing list