[torqueusers] Torque/maui node failure policy revisted again
glen.beane at gmail.com
Fri Jan 2 12:37:43 MST 2009
On Sun, Dec 28, 2008 at 1:19 PM, Glen Beane <glen.beane at gmail.com> wrote:
> On Wed, Dec 17, 2008 at 6:21 PM, Chris Samuel <csamuel at vpac.org> wrote:
>> ----- "Glen Beane" <glen.beane at gmail.com> wrote:
>>> This may be a good feature to add triggered by a
>>> per job setting.
>> Hmm, good point, I was thinking that we couldn't
>> use the feature here, but if it defaulted to off
>> and could be enable on request it would be more
>> Probably would be far more invasive though.
> I'll probably go ahead and make this a job attribute. I'll add
> support to it to the disallowed_types queue attribute so you can
> disallow fault tolerant or fault intolerant jobs from a particular
> queue. I will also add an option to specify the default value of the
> attribute to torque.cfg (it will default to false (not fault tolerant)
> unless overridden in torque.cfg).
I'm almost ready to commit this change to trunk (the 2.4.0 dev
branch). Instead of having a mom config option to enable this
behavior it is enabled on a per job basis. The qsub option is -f
(for fault_tolerant), or if you build torque with
PBS_NO_POSIX_VIOLATION defined -W
fault_tolerant=[0,1,yes,no,true,false] (this is case insensitive and
it actually only looks at the first character of the string)
the fault_tolerant job attribute defaults to false, but the default
value can be changed by the FAULT_TOLERANT_BY_DEFAULT torque.cfg
parameter(add FAULT_TOLERANT_BY_DEFAULT true to
I will also be adding fault_tolerant to the queue disallowed_types
attribute. This will cause any job with this set to true to be
rejected from a queue. I will probably also add a fault_intolerant to
the disallowed_types for completeness.
I'm looking for someone that will be willing to test this out, since I
don't have a cluster I can really test this on. Please email me
directly if you are willing to test this out a bit.
More information about the torqueusers