[torqueusers] Job Nanny Poll

David Beer dbeer at adaptivecomputing.com
Mon Nov 21 13:55:12 MST 2011



----- Original Message -----
> 
> 
> ----- Original Message -----
> > On Mon, Nov 21, 2011 at 2:06 PM, David Beer
> > <dbeer at adaptivecomputing.com> wrote:
> > > All,
> > >
> > > Just a quick poll question - do people use the job delete nanny
> > > functionality in TORQUE? If you do, in qmgr you would have the
> > > line:
> > >
> > > set job_nanny = True
> > >
> > > I'm curious how many people are using it - this seems like very
> > > repetitive functionality to me (pbs_mom does pretty much the same
> > > thing already) and I personally think job_force_cancel_time is
> > > better, but I may be biased.
> > 
> > 
> > I use the job delete nanny, but I am not familiar with
> > job_force_cancel_time. I have been using the job delete nanny for a
> > long time.
> > 
> > What exactly does it do? I presume some of the multi-threading in
> > pbs_server in TORQUE 4.0 can clean up some of this code a little
> > since
> > pbs_server can spawn a thread to hang out and manage the job delete
> > (rather than needing to set a work task to check the status of the
> > delete in the future)
> 
> I'm also using job_nanny, this is the first I've heard of
> job_force_cancel_time. Following a quick search it looks like it
> might have been undocumented for a short period. So it takes an int
> but there's not a recommended value, maybe 300? Is this the
> automation of 'qdel -p [jobid]' or similar for jobs *stuck* when a
> node stops responding?
> 
> Steve
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> 

job_force_cancel_time is different from the nanny (some sites may wish to do both, even). I implemented job_force_cancel_time at the request of a customer for this use case:

- Multi-node job is running
- mother superior goes down
- Moab, pbs_server or an admin tries to delete the job
- the job cannot be deleted because pbs_server cannot talk to mother superior.

job_force_cancel_time, as you suspected, purges the job (qdel -p) if it still exists after the configured number of seconds. I think preferences for how long that should be will probably be somewhat different from site to site, but if I were setting the parameter, I would think about how long I think it could reasonably take to delete a job and perhaps multiply that by 3 or 4 and set it to that value. Something along those lines.

-- 
David Beer 
Direct Line: 801-717-3386 | Fax: 801-717-3738
     Adaptive Computing
     1712 S East Bay Blvd, Suite 300
     Provo, UT 84606



More information about the torqueusers mailing list