[torqueusers] jobs completing with processes still running - SOLVED

Michael Robbert mrobbert at mines.edu
Thu May 8 11:26:45 MDT 2008


Thank you all for your comments and suggestions. It has been a great 
introductory lesson. I can't wait to get properly schooled at Moab Con 
in a few weeks. The original problem turned out to be problems with a 
few specific nodes. Whenever these particular nodes were assigned to a 
job the job would return immediately with no results. I don't know why 
this was happening, but since we're using ROCKS I just rebuilt them all 
and they seem to be working now.
We do still have the problem of leftover processes when jobs are 
canceled. I will need to go through and validate all of our MPI 
implementations, but the current known problem is with mpirun when used 
with the MVAPICH that came bundled with the Cicso OFED ROLL on ROCKS+. 
So, unless anybody knows off the top of their head if there is a known 
workaround for this issue I'll probably need to open up a ticket with 
Cluster Corp.

Thanks for all your help,
Mike Robbert
Colorado School of Mines

Glen Beane wrote:
>
>
> On Wed, May 7, 2008 at 1:09 PM, Michael Robbert <mrobbert at mines.edu 
> <mailto:mrobbert at mines.edu>> wrote:
>
>
>     I would also like to figure out why these processes continue to
>     run after this false exit or after a canceljob. The code is being
>     run with mpirun and he is using mvapich. We do not have an mpiexec
>     in our mvapich path. I know that mpirun works fine for OpenMPI,
>     and OpenMPI has an mpiexec. They are currently seeing huge speed
>     advantages with mvapich so until we work out any issues with
>     OpenMPI and their code I can't tell them to use OpenMPI. 
>
>
> if processes continue to run after a canceljob / qdel or hitting a 
> walltime limit then the problem is almost certainly that you are using 
> a non-tm  job launcher.  tm is the PBS/TORQUE task manager API. You 
> want to use a job launcher than uses tm to spawn all of the processes 
> rather than something else like rsh/ssh
>
> OpenMPI has native tm support, you just have to make sure it can find 
> the tm library when you run configure.  For mvapich you can use 
> mpiexec from Pete Wyckoff at OSC: http://www.osc.edu/~pw/mpiexec 
> <http://www.osc.edu/%7Epw/mpiexec>.  Hide your mvapich mpirun from 
> your users and make them use Pete's mpiexec.  A non-tm job launcher is 
> nothing but trouble with PBS or TORQUE
>


More information about the torqueusers mailing list