[torqueusers] Cleaning up stray processes from defunct jobs
tbaer at utk.edu
Mon Oct 8 14:32:45 MDT 2012
On Mon, 2012-10-08 at 15:26 -0500, Dave Ulrick wrote:
> On Thu, 27 Sep 2012, Troy Baer wrote:
> > On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote:
> >> On occasion I see a user run an MPI job via TORQUE that doesn't shut down
> >> cleanly and as a result leaves running processes behind to interfere with
> >> subsequent jobs that are assigned to its nodes. Any suggestions on how I
> >> might go about simplifying the task of finding and killing these
> >> processes?
> > I would recommend running something like reaver  in your
> > epilogue.parallel on each node.
> >  http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver
> > --Troy
> I've deployed reaver to my compute nodes and have run some test jobs. It
> appears that TORQUE runs 'epilogue' on the job head node and
> 'epilogue.parallel' on the sister nodes so I've got both scripts set up to
> run reaver. I don't have a job at hand that will create stray processes so
> I'll just wait and see what reaver does the next time such a job runs.
Be aware that reaver doesn't kill processes unless you specifically tell
it to do so with the -k option. I would recommend running in the
default identification-only mode for a while until you're sure that it's
consistently identifying processes that need killed.
Troy Baer, Senior HPC System Administrator
National Institute for Computational Sciences, University of Tennessee
More information about the torqueusers