[torqueusers] jobs completing with processes still running - SOLVED

Jerry Smith jdsmit at sandia.gov
Thu May 8 11:41:56 MDT 2008


I would have to second this thought (OpenMPI, as well as OSC's mpiexec 
for your current setup).
Have you looked into the different epilogues
that float around on this list as a way to make sure processes that may 
end up
outside of the TM interface, get cleaned up?

http://www.clusterresources.com/wiki/doku.php?id=torque:appendix:g_prologue_and_epilogue_scripts

--Jerry

Brock Palen wrote:
> Mavapich is just mpich with IB support added so using mpiexec from
> OSC will work.
>
> Again even Cisco is pushing to OpenMPI in the future, which has TM
> support built in.  One of the primary devs of OpenMPI is paid by
> cisco to work on it and make sure it works with their IB.  So I would
> push you to that solution.  (its what we use).
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> On May 8, 2008, at 1:26 PM, Michael Robbert wrote:
>   
>> Thank you all for your comments and suggestions. It has been a
>> great introductory lesson. I can't wait to get properly schooled at
>> Moab Con in a few weeks. The original problem turned out to be
>> problems with a few specific nodes. Whenever these particular nodes
>> were assigned to a job the job would return immediately with no
>> results. I don't know why this was happening, but since we're using
>> ROCKS I just rebuilt them all and they seem to be working now.
>> We do still have the problem of leftover processes when jobs are
>> canceled. I will need to go through and validate all of our MPI
>> implementations, but the current known problem is with mpirun when
>> used with the MVAPICH that came bundled with the Cicso OFED ROLL on
>> ROCKS+. So, unless anybody knows off the top of their head if there
>> is a known workaround for this issue I'll probably need to open up
>> a ticket with Cluster Corp.
>>
>> Thanks for all your help,
>> Mike Robbert
>> Colorado School of Mines
>>
>> Glen Beane wrote:
>>     
>>> On Wed, May 7, 2008 at 1:09 PM, Michael Robbert
>>> <mrobbert at mines.edu <mailto:mrobbert at mines.edu>> wrote:
>>>
>>>
>>>     I would also like to figure out why these processes continue to
>>>     run after this false exit or after a canceljob. The code is being
>>>     run with mpirun and he is using mvapich. We do not have an
>>> mpiexec
>>>     in our mvapich path. I know that mpirun works fine for OpenMPI,
>>>     and OpenMPI has an mpiexec. They are currently seeing huge speed
>>>     advantages with mvapich so until we work out any issues with
>>>     OpenMPI and their code I can't tell them to use OpenMPI.
>>>
>>> if processes continue to run after a canceljob / qdel or hitting a
>>> walltime limit then the problem is almost certainly that you are
>>> using a non-tm  job launcher.  tm is the PBS/TORQUE task manager
>>> API. You want to use a job launcher than uses tm to spawn all of
>>> the processes rather than something else like rsh/ssh
>>>
>>> OpenMPI has native tm support, you just have to make sure it can
>>> find the tm library when you run configure.  For mvapich you can
>>> use mpiexec from Pete Wyckoff at OSC: http://www.osc.edu/~pw/
>>> mpiexec <http://www.osc.edu/%7Epw/mpiexec>.  Hide your mvapich
>>> mpirun from your users and make them use Pete's mpiexec.  A non-tm
>>> job launcher is nothing but trouble with PBS or TORQUE
>>>
>>>       
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>>
>>     
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20080508/aec435b8/attachment-0001.html


More information about the torqueusers mailing list