[torqueusers] Torque/maui node failure policy revisted again

Charles at Schwieters.org Charles at Schwieters.org
Tue Dec 16 02:21:34 MST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Hello Chris--

> 
> ----- Charles at Schwieters.org wrote:
> 
> >   I saw a message with this subject from June, 2007 along with a
> > patch creating a fatal_job_poll_failure mom_priv/config option.
> > This option prevents the failure of a single node deleting an
> > entire job.
> 
> Could you explain why you would not want the existing
> behaviour please ?
> 

My code is robust in the failure of any node, except for the first.  I
have encountered many clusters with relatively unreliable nodes. With
the current behavior we avoid using torque/pbs on such clusters.

I am not the first to request this feature, as attested by the existence
of the patch. A google search on fatal_job_poll_failure shows that a
patched version of torque has been deployed in multiple locations.

thanks--
Charles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>

iD8DBQFJR3MePK2zrJwS/lYRAgBwAKCIPjcq2eocf+9uoyw8nZIu+I53HgCffqLG
3BonTTuytQPu78ltwEEKUss=
=+KMc
-----END PGP SIGNATURE-----


More information about the torqueusers mailing list