[torqueusers] Job Nanny Poll

glen.beane at gmail.com glen.beane at gmail.com
Mon Nov 21 14:39:57 MST 2011

On Nov 21, 2011, at 4:04 PM, David Beer <dbeer at adaptivecomputing.com> wrote:

> ----- Original Message -----
>> David,
>> We currently have job_nanny set in our qmgr.  I don't remember many
>> of
>> the details, but we were struggling a long time ago with some of the
>> issues related to stray jobs not being deleted, and someone at
>> Adaptive
>> (or maybe it was still CRI at the time), recommended we set it, as an
>> additional check.  If there's a better way to do it, that's fine, but
>> many of us try really hard not to change things once they're working.
>> That and we're not aware of the new features.  I mean according to
>> the
>> changelog, that parameter was added in 2.4.9, and I think we've been
>> using job_nanny much longer than that.  I wasn't aware of it until
>> your
>> email.
>> Lloyd Brown
>> Systems Administrator
>> Fulton Supercomputing Lab
>> Brigham Young University
>> http://marylou.byu.edu
>> On 11/21/2011 12:06 PM, David Beer wrote:
>>> All,
>>> Just a quick poll question - do people use the job delete nanny
>>> functionality in TORQUE? If you do, in qmgr you would have the
>>> line:
>>> set job_nanny = True
>>> I'm curious how many people are using it - this seems like very
>>> repetitive functionality to me (pbs_mom does pretty much the same
>>> thing already) and I personally think job_force_cancel_time is
>>> better, but I may be biased.
> Allow me to re-ask my question in a different way - what is the use case for which you all are using the job_nanny feature? 
> Just to dispel any fears, I'm only asking this for my own curiosity. I'm working on something that has me looking at that code and I'm just curious if people use it and what they use it for. This code is not being deleted or removed.

I think the use case is that the mom doesn't respond to the first delete request, but is actually still up

With TCP instead of UDP I don't see that this would be much of a problem

With a temporary failure or a dropped message I think this is better than a purge since the normal termination/cleanup happens

> -- 
> David Beer 
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1712 S East Bay Blvd, Suite 300
>     Provo, UT 84606
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

More information about the torqueusers mailing list