[torqueusers] Torque/maui node failure policy revisted again

Glen Beane glen.beane at gmail.com
Fri Jan 2 12:37:43 MST 2009

On Sun, Dec 28, 2008 at 1:19 PM, Glen Beane <glen.beane at gmail.com> wrote:
> On Wed, Dec 17, 2008 at 6:21 PM, Chris Samuel <csamuel at vpac.org> wrote:
>> ----- "Glen Beane" <glen.beane at gmail.com> wrote:
>>> This may be a good feature to add triggered by a
>>> per job setting.
>> Hmm, good point, I was thinking that we couldn't
>> use the feature here, but if it defaulted to off
>> and could be enable on request it would be more
>> useful.
>> Probably would be far more invasive though.
> I'll probably go ahead and make this a job attribute.  I'll add
> support to it to the disallowed_types queue attribute so you can
> disallow fault tolerant or fault intolerant jobs from a particular
> queue. I will also add an option to specify the default value of the
> attribute to torque.cfg (it will default to false (not fault tolerant)
> unless overridden in torque.cfg).

I'm almost ready to commit this change to trunk (the 2.4.0 dev
branch).  Instead of having a mom config option to enable this
behavior it is enabled on a per job basis.   The qsub option is -f
(for fault_tolerant),  or if you build torque with
fault_tolerant=[0,1,yes,no,true,false] (this is case insensitive and
it actually only looks at the first character of the string)

the fault_tolerant job attribute defaults to false, but the default
value can be changed by the FAULT_TOLERANT_BY_DEFAULT torque.cfg
parameter(add FAULT_TOLERANT_BY_DEFAULT true to

I will also be adding fault_tolerant to the queue disallowed_types
attribute.  This will cause any job with this set to true to be
rejected from a queue.  I will probably also add a fault_intolerant to
the disallowed_types for completeness.

I'm looking for someone that will be willing to test this out, since I
don't have a cluster I can really test this on. Please email me
directly if you are willing to test this out a bit.

More information about the torqueusers mailing list