[torqueusers] jobs completing with processes still running

Glen Beane glen.beane at gmail.com
Thu May 8 05:43:52 MDT 2008


On Wed, May 7, 2008 at 1:09 PM, Michael Robbert <mrobbert at mines.edu> wrote:

>
> I would also like to figure out why these processes continue to run after
> this false exit or after a canceljob. The code is being run with mpirun and
> he is using mvapich. We do not have an mpiexec in our mvapich path. I know
> that mpirun works fine for OpenMPI, and OpenMPI has an mpiexec. They are
> currently seeing huge speed advantages with mvapich so until we work out any
> issues with OpenMPI and their code I can't tell them to use OpenMPI.


if processes continue to run after a canceljob / qdel or hitting a walltime
limit then the problem is almost certainly that you are using a non-tm  job
launcher.  tm is the PBS/TORQUE task manager API. You want to use a job
launcher than uses tm to spawn all of the processes rather than something
else like rsh/ssh

OpenMPI has native tm support, you just have to make sure it can find the tm
library when you run configure.  For mvapich you can use mpiexec from Pete
Wyckoff at OSC: http://www.osc.edu/~pw/mpiexec.  Hide your mvapich mpirun
from your users and make them use Pete's mpiexec.  A non-tm job launcher is
nothing but trouble with PBS or TORQUE
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080508/5dc4174c/attachment.html


More information about the torqueusers mailing list