[torqueusers] Cleaning up stray processes from defunct jobs

Dave Ulrick d-ulrick at comcast.net
Mon Oct 8 14:26:31 MDT 2012


On Thu, 27 Sep 2012, Troy Baer wrote:

> On Thu, 2012-09-27 at 16:27 -0500, Dave Ulrick wrote:
>> On occasion I see a user run an MPI job via TORQUE that doesn't shut down
>> cleanly and as a result leaves running processes behind to interfere with
>> subsequent jobs that are assigned to its nodes. Any suggestions on how I
>> might go about simplifying the task of finding and killing these
>> processes?
>
> I would recommend running something like reaver [1] in your
> epilogue.parallel on each node.
>
> [1] http://svn.nics.tennessee.edu/repos/pbstools/trunk/sbin/reaver
>
> 	--Troy

I've deployed reaver to my compute nodes and have run some test jobs. It 
appears that TORQUE runs 'epilogue' on the job head node and 
'epilogue.parallel' on the sister nodes so I've got both scripts set up to 
run reaver. I don't have a job at hand that will create stray processes so 
I'll just wait and see what reaver does the next time such a job runs.

Thanks,
Dave
-- 
Dave Ulrick
d-ulrick at comcast.net


More information about the torqueusers mailing list