[torqueusers] Cleaning up stray processes from defunct jobs

Dave Ulrick d-ulrick at comcast.net
Wed Oct 10 09:48:46 MDT 2012

On Mon, 8 Oct 2012, Troy Baer wrote:

> On Mon, 2012-10-08 at 15:26 -0500, Dave Ulrick wrote:
>> On Thu, 27 Sep 2012, Troy Baer wrote:
>>> On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote:
>>>> On occasion I see a user run an MPI job via TORQUE that doesn't shut down
>>>> cleanly and as a result leaves running processes behind to interfere with
>>>> subsequent jobs that are assigned to its nodes. Any suggestions on how I
>>>> might go about simplifying the task of finding and killing these
>>>> processes?
>>> I would recommend running something like reaver [1] in your
>>> epilogue.parallel on each node.
>>> [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver
>>> 	--Troy
>> I've deployed reaver to my compute nodes and have run some test jobs. It
>> appears that TORQUE runs 'epilogue' on the job head node and
>> 'epilogue.parallel' on the sister nodes so I've got both scripts set up to
>> run reaver. I don't have a job at hand that will create stray processes so
>> I'll just wait and see what reaver does the next time such a job runs.
> Be aware that reaver doesn't kill processes unless you specifically tell
> it to do so with the -k option.  I would recommend running in the
> default identification-only mode for a while until you're sure that it's
> consistently identifying processes that need killed.

I've been running reaver for a few days now. I've identified a situation 
where a job left behind stray processes that reaver didn't remove. This 
apparently happened because the job's 
/var/spool/torque/mom_priv/jobs/foo.JB file wasn't removed when the job 
ended. I've got my epilogue and epilogue.parallel scripts writing 
to a log file whenever they run. For some reason only the 'epilogue' 
script ran when the job with stray processes ended. 'epilogue.parallel' 
wasn't run on any nodes. For many other jobs, none left stray processes, 
and all appear to have run both 'epilogue' and 'epilogue.parallel' 
scripts. Any idea of what went wrong and how to fix it?

Dave Ulrick
d-ulrick at comcast.net

More information about the torqueusers mailing list