[torqueusers] Torque/maui node failure policy revisted again
csamuel at vpac.org
Mon Dec 15 14:54:06 MST 2008
----- Charles at Schwieters.org wrote:
> I saw a message with this subject from June, 2007 along with a
> patch creating a fatal_job_poll_failure mom_priv/config option.
> This option prevents the failure of a single node deleting an
> entire job.
Could you explain why you would not want the existing
behaviour please ?
For all codes that I'm aware of at present losing a
single node of a parallel job means the parallel job
has failed so I'd be really interested to hear where
that's not necessarily the case.
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the torqueusers