[torqueusers] Job Nanny Poll

glen.beane at gmail.com glen.beane at gmail.com
Mon Nov 21 14:32:07 MST 2011



On Nov 21, 2011, at 3:55 PM, David Beer <dbeer at adaptivecomputing.com> wrote:

> 
> 
> ----- Original Message -----
>> 
>> 
>> ----- Original Message -----
>>> On Mon, Nov 21, 2011 at 2:06 PM, David Beer
>>> <dbeer at adaptivecomputing.com> wrote:
>>>> All,
>>>> 
>>>> Just a quick poll question - do people use the job delete nanny
>>>> functionality in TORQUE? If you do, in qmgr you would have the
>>>> line:
>>>> 
>>>> set job_nanny = True
>>>> 
>>>> I'm curious how many people are using it - this seems like very
>>>> repetitive functionality to me (pbs_mom does pretty much the same
>>>> thing already) and I personally think job_force_cancel_time is
>>>> better, but I may be biased.
>>> 
>>> 
>>> I use the job delete nanny, but I am not familiar with
>>> job_force_cancel_time. I have been using the job delete nanny for a
>>> long time.
>>> 
>>> What exactly does it do? I presume some of the multi-threading in
>>> pbs_server in TORQUE 4.0 can clean up some of this code a little
>>> since
>>> pbs_server can spawn a thread to hang out and manage the job delete
>>> (rather than needing to set a work task to check the status of the
>>> delete in the future)
>> 
>> I'm also using job_nanny, this is the first I've heard of
>> job_force_cancel_time. Following a quick search it looks like it
>> might have been undocumented for a short period. So it takes an int
>> but there's not a recommended value, maybe 300? Is this the
>> automation of 'qdel -p [jobid]' or similar for jobs *stuck* when a
>> node stops responding?
>> 
>> Steve
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>> 
> 
> job_force_cancel_time is different from the nanny (some sites may wish to do both, even). I implemented job_force_cancel_time at the request of a customer for this use case:
> 
> - Multi-node job is running
> - mother superior goes down
> - Moab, pbs_server or an admin tries to delete the job
> - the job cannot be deleted because pbs_server cannot talk to mother superior.
> 
> job_force_cancel_time, as you suspected, purges the job (qdel -p) if it still exists after the configured number of seconds. I think preferences for how long that should be will probably be somewhat different from site to site, but if I were setting the parameter, I would think about how long I think it could reasonably take to delete a job and perhaps multiply that by 3 or 4 and set it to that value. Something along those lines.
> 

I would rather have the job requeued (if it is rerunnable) than just purged


> -- 
> David Beer 
> Direct Line: 801-717-3386 | Fax: 801-717-3738
>     Adaptive Computing
>     1712 S East Bay Blvd, Suite 300
>     Provo, UT 84606
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list