[torqueusers] jobs completing with processes still running
jdsmit at sandia.gov
Thu May 8 11:41:56 MDT 2008
I would have to second this thought (OpenMPI, as well as OSC's mpiexec
for your current setup).
Have you looked into the different epilogues
that float around on this list as a way to make sure processes that may
outside of the TM interface, get cleaned up?
Brock Palen wrote:
> Mavapich is just mpich with IB support added so using mpiexec from
> OSC will work.
> Again even Cisco is pushing to OpenMPI in the future, which has TM
> support built in. One of the primary devs of OpenMPI is paid by
> cisco to work on it and make sure it works with their IB. So I would
> push you to that solution. (its what we use).
> Brock Palen
> Center for Advanced Computing
> brockp at umich.edu
> On May 8, 2008, at 1:26 PM, Michael Robbert wrote:
>> Thank you all for your comments and suggestions. It has been a
>> great introductory lesson. I can't wait to get properly schooled at
>> Moab Con in a few weeks. The original problem turned out to be
>> problems with a few specific nodes. Whenever these particular nodes
>> were assigned to a job the job would return immediately with no
>> results. I don't know why this was happening, but since we're using
>> ROCKS I just rebuilt them all and they seem to be working now.
>> We do still have the problem of leftover processes when jobs are
>> canceled. I will need to go through and validate all of our MPI
>> implementations, but the current known problem is with mpirun when
>> used with the MVAPICH that came bundled with the Cicso OFED ROLL on
>> ROCKS+. So, unless anybody knows off the top of their head if there
>> is a known workaround for this issue I'll probably need to open up
>> a ticket with Cluster Corp.
>> Thanks for all your help,
>> Mike Robbert
>> Colorado School of Mines
>> Glen Beane wrote:
>>> On Wed, May 7, 2008 at 1:09 PM, Michael Robbert
>>> <mrobbert at mines.edu <mailto:mrobbert at mines.edu>> wrote:
>>> I would also like to figure out why these processes continue to
>>> run after this false exit or after a canceljob. The code is being
>>> run with mpirun and he is using mvapich. We do not have an
>>> in our mvapich path. I know that mpirun works fine for OpenMPI,
>>> and OpenMPI has an mpiexec. They are currently seeing huge speed
>>> advantages with mvapich so until we work out any issues with
>>> OpenMPI and their code I can't tell them to use OpenMPI.
>>> if processes continue to run after a canceljob / qdel or hitting a
>>> walltime limit then the problem is almost certainly that you are
>>> using a non-tm job launcher. tm is the PBS/TORQUE task manager
>>> API. You want to use a job launcher than uses tm to spawn all of
>>> the processes rather than something else like rsh/ssh
>>> OpenMPI has native tm support, you just have to make sure it can
>>> find the tm library when you run configure. For mvapich you can
>>> use mpiexec from Pete Wyckoff at OSC: http://www.osc.edu/~pw/
>>> mpiexec <http://www.osc.edu/%7Epw/mpiexec>. Hide your mvapich
>>> mpirun from your users and make them use Pete's mpiexec. A non-tm
>>> job launcher is nothing but trouble with PBS or TORQUE
>> torqueusers mailing list
>> torqueusers at supercluster.org
> torqueusers mailing list
> torqueusers at supercluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers