[torqueusers] Cleaning up stray processes from defunct jobs

Troy Baer tbaer at utk.edu
Mon Oct 8 14:32:45 MDT 2012


On Mon, 2012-10-08 at 15:26 -0500, Dave Ulrick wrote:
> On Thu, 27 Sep 2012, Troy Baer wrote:
> > On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote:
> >> On occasion I see a user run an MPI job via TORQUE that doesn't shut down
> >> cleanly and as a result leaves running processes behind to interfere with
> >> subsequent jobs that are assigned to its nodes. Any suggestions on how I
> >> might go about simplifying the task of finding and killing these
> >> processes?
> >
> > I would recommend running something like reaver [1] in your
> > epilogue.parallel on each node.
> >
> > [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver
> >
> > 	--Troy
> 
> I've deployed reaver to my compute nodes and have run some test jobs. It 
> appears that TORQUE runs 'epilogue' on the job head node and 
> 'epilogue.parallel' on the sister nodes so I've got both scripts set up to 
> run reaver. I don't have a job at hand that will create stray processes so 
> I'll just wait and see what reaver does the next time such a job runs.

Be aware that reaver doesn't kill processes unless you specifically tell
it to do so with the -k option.  I would recommend running in the
default identification-only mode for a while until you're sure that it's
consistently identifying processes that need killed.

	--Troy
-- 
Troy Baer, Senior HPC System Administrator
National Institute for Computational Sciences, University of Tennessee
http://www.nics.tennessee.edu/
Phone:  865-241-4233




More information about the torqueusers mailing list