[torqueusers] Job Nanny Poll
stevejones at stanford.edu
Mon Nov 21 14:24:57 MST 2011
> Allow me to re-ask my question in a different way - what is the use
> case for which you all are using the job_nanny feature?
> Just to dispel any fears, I'm only asking this for my own curiosity.
> I'm working on something that has me looking at that code and I'm just
> curious if people use it and what they use it for. This code is not
> being deleted or removed.
We implemented in hopes of a better job cleanup and removal strategy in hopes of it continuing to send KILL signals. What we've found is when a node completely stops responding the job *hangs* in the queue in a cancel in progess state. job_force_cancel_time should help with this, I'm thinking we'll implement it with a time of 5 minutes or so to allow temporary nodes failures to resolve themselves. I'm planning on using both.
In a general sense I still have issues with processes left behind on compute nodes here and there. We're also using mom_job_sync, epilogue and epilogue.parallel scripts, all in an effort to kill unassigned (ghost) processes. I'd like to see more examples of how people are dealing with this.
More information about the torqueusers