[torqueusers] Torque/maui node failure policy revisted again

Glen Beane glen.beane at gmail.com
Fri Jan 2 12:37:43 MST 2009


On Sun, Dec 28, 2008 at 1:19 PM, Glen Beane <glen.beane at gmail.com> wrote:
> On Wed, Dec 17, 2008 at 6:21 PM, Chris Samuel <csamuel at vpac.org> wrote:
>>
>> ----- "Glen Beane" <glen.beane at gmail.com> wrote:
>>
>>> This may be a good feature to add triggered by a
>>> per job setting.
>>
>> Hmm, good point, I was thinking that we couldn't
>> use the feature here, but if it defaulted to off
>> and could be enable on request it would be more
>> useful.
>>
>> Probably would be far more invasive though.
>
>
> I'll probably go ahead and make this a job attribute.  I'll add
> support to it to the disallowed_types queue attribute so you can
> disallow fault tolerant or fault intolerant jobs from a particular
> queue. I will also add an option to specify the default value of the
> attribute to torque.cfg (it will default to false (not fault tolerant)
> unless overridden in torque.cfg).



I'm almost ready to commit this change to trunk (the 2.4.0 dev
branch).  Instead of having a mom config option to enable this
behavior it is enabled on a per job basis.   The qsub option is -f
(for fault_tolerant),  or if you build torque with
PBS_NO_POSIX_VIOLATION defined -W
fault_tolerant=[0,1,yes,no,true,false] (this is case insensitive and
it actually only looks at the first character of the string)

the fault_tolerant job attribute defaults to false, but the default
value can be changed by the FAULT_TOLERANT_BY_DEFAULT torque.cfg
parameter(add FAULT_TOLERANT_BY_DEFAULT true to
/var/spool/torque/torque.cfg).

I will also be adding fault_tolerant to the queue disallowed_types
attribute.  This will cause any job with this set to true to be
rejected from a queue.  I will probably also add a fault_intolerant to
the disallowed_types for completeness.

I'm looking for someone that will be willing to test this out, since I
don't have a cluster I can really test this on. Please email me
directly if you are willing to test this out a bit.


More information about the torqueusers mailing list