[torqueusers] Torque/maui node failure policy revisted again
Charles at Schwieters.org
Charles at Schwieters.org
Tue Dec 16 02:21:34 MST 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello Chris--
>
> ----- Charles at Schwieters.org wrote:
>
> > I saw a message with this subject from June, 2007 along with a
> > patch creating a fatal_job_poll_failure mom_priv/config option.
> > This option prevents the failure of a single node deleting an
> > entire job.
>
> Could you explain why you would not want the existing
> behaviour please ?
>
My code is robust in the failure of any node, except for the first. I
have encountered many clusters with relatively unreliable nodes. With
the current behavior we avoid using torque/pbs on such clusters.
I am not the first to request this feature, as attested by the existence
of the patch. A google search on fatal_job_poll_failure shows that a
patched version of torque has been deployed in multiple locations.
thanks--
Charles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8+ <http://mailcrypt.sourceforge.net/>
iD8DBQFJR3MePK2zrJwS/lYRAgBwAKCIPjcq2eocf+9uoyw8nZIu+I53HgCffqLG
3BonTTuytQPu78ltwEEKUss=
=+KMc
-----END PGP SIGNATURE-----
More information about the torqueusers
mailing list