[torqueusers] Re: jobs completing with processes still running - SOLVED

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu May 8 12:47:55 MDT 2008


Hi Mike,

We have some nodes with a non-TM MPI and hence get leftover processes
that don't take any CPU time but are still bothersome. I didn't find any
neat way to kill these in epilogues. My solution was to write a small
script that identifies any non-Torque job related user processes and
kills them. This script is run from cron a few times per hour:
# Kill rogue user processes
5,35 * * * * /root/killbaduser -k -s >> /var/log/killbaduser.log 2>&1

This has solved our problem. Get the script from
ftp://ftp.fysik.dtu.dk/pub/Torque/killbaduser-1.5

Caveat: If the same user has other valid Torque jobs on the same node,
the script won't kill the leftover processes. This will however happen
when the user no longer has any jobs on the node at a later time.

Michael Robbert <mrobbert at mines.edu> wrote:
> We do still have the problem of leftover processes when jobs are 
> canceled. I will need to go through and validate all of our MPI 
> implementations, but the current known problem is with mpirun when used 
> with the MVAPICH that came bundled with the Cicso OFED ROLL on ROCKS+. 
> So, unless anybody knows off the top of their head if there is a known 
> workaround for this issue I'll probably need to open up a ticket with 
> Cluster Corp.

/Ole H. Nielsen
Technical University of Denmark


More information about the torqueusers mailing list