[torqueusers] jobs completing with processes still running - SOLVED

Jerry Smith jdsmit at sandia.gov
Thu May 8 11:41:56 MDT 2008

I would have to second this thought (OpenMPI, as well as OSC's mpiexec 
for your current setup).
Have you looked into the different epilogues
that float around on this list as a way to make sure processes that may 
end up
outside of the TM interface, get cleaned up?



Brock Palen wrote:
> Mavapich is just mpich with IB support added so using mpiexec from
> OSC will work.
> Again even Cisco is pushing to OpenMPI in the future, which has TM
> support built in.  One of the primary devs of OpenMPI is paid by
> cisco to work on it and make sure it works with their IB.  So I would
> push you to that solution.  (its what we use).
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
> On May 8, 2008, at 1:26 PM, Michael Robbert wrote:
>> Thank you all for your comments and suggestions. It has been a
>> great introductory lesson. I can't wait to get properly schooled at
>> Moab Con in a few weeks. The original problem turned out to be
>> problems with a few specific nodes. Whenever these particular nodes
>> were assigned to a job the job would return immediately with no
>> results. I don't know why this was happening, but since we're using
>> ROCKS I just rebuilt them all and they seem to be working now.
>> We do still have the problem of leftover processes when jobs are
>> canceled. I will need to go through and validate all of our MPI
>> implementations, but the current known problem is with mpirun when
>> used with the MVAPICH that came bundled with the Cicso OFED ROLL on
>> ROCKS+. So, unless anybody knows off the top of their head if there
>> is a known workaround for this issue I'll probably need to open up
>> a ticket with Cluster Corp.
>> Thanks for all your help,
>> Mike Robbert
>> Colorado School of Mines
>> Glen Beane wrote:
>>> On Wed, May 7, 2008 at 1:09 PM, Michael Robbert
>>> <mrobbert at mines.edu <mailto:mrobbert at mines.edu>> wrote:
>>>     I would also like to figure out why these processes continue to
>>>     run after this false exit or after a canceljob. The code is being
>>>     run with mpirun and he is using mvapich. We do not have an
>>> mpiexec
>>>     in our mvapich path. I know that mpirun works fine for OpenMPI,
>>>     and OpenMPI has an mpiexec. They are currently seeing huge speed
>>>     advantages with mvapich so until we work out any issues with
>>>     OpenMPI and their code I can't tell them to use OpenMPI.
>>> if processes continue to run after a canceljob / qdel or hitting a
>>> walltime limit then the problem is almost certainly that you are
>>> using a non-tm  job launcher.  tm is the PBS/TORQUE task manager
>>> API. You want to use a job launcher than uses tm to spawn all of
>>> the processes rather than something else like rsh/ssh
>>> OpenMPI has native tm support, you just have to make sure it can
>>> find the tm library when you run configure.  For mvapich you can
>>> use mpiexec from Pete Wyckoff at OSC: http://www.osc.edu/~pw/
>>> mpiexec <http://www.osc.edu/%7Epw/mpiexec>.  Hide your mvapich
>>> mpirun from your users and make them use Pete's mpiexec.  A non-tm
>>> job launcher is nothing but trouble with PBS or TORQUE
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080508/aec435b8/attachment-0001.html

More information about the torqueusers mailing list