[torqueusers] jobs completing with processes still running - SOLVED

Brock Palen brockp at umich.edu
Thu May 8 11:31:22 MDT 2008


Mavapich is just mpich with IB support added so using mpiexec from  
OSC will work.

Again even Cisco is pushing to OpenMPI in the future, which has TM  
support built in.  One of the primary devs of OpenMPI is paid by  
cisco to work on it and make sure it works with their IB.  So I would  
push you to that solution.  (its what we use).

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985



On May 8, 2008, at 1:26 PM, Michael Robbert wrote:
> Thank you all for your comments and suggestions. It has been a  
> great introductory lesson. I can't wait to get properly schooled at  
> Moab Con in a few weeks. The original problem turned out to be  
> problems with a few specific nodes. Whenever these particular nodes  
> were assigned to a job the job would return immediately with no  
> results. I don't know why this was happening, but since we're using  
> ROCKS I just rebuilt them all and they seem to be working now.
> We do still have the problem of leftover processes when jobs are  
> canceled. I will need to go through and validate all of our MPI  
> implementations, but the current known problem is with mpirun when  
> used with the MVAPICH that came bundled with the Cicso OFED ROLL on  
> ROCKS+. So, unless anybody knows off the top of their head if there  
> is a known workaround for this issue I'll probably need to open up  
> a ticket with Cluster Corp.
>
> Thanks for all your help,
> Mike Robbert
> Colorado School of Mines
>
> Glen Beane wrote:
>>
>>
>> On Wed, May 7, 2008 at 1:09 PM, Michael Robbert  
>> <mrobbert at mines.edu <mailto:mrobbert at mines.edu>> wrote:
>>
>>
>>     I would also like to figure out why these processes continue to
>>     run after this false exit or after a canceljob. The code is being
>>     run with mpirun and he is using mvapich. We do not have an  
>> mpiexec
>>     in our mvapich path. I know that mpirun works fine for OpenMPI,
>>     and OpenMPI has an mpiexec. They are currently seeing huge speed
>>     advantages with mvapich so until we work out any issues with
>>     OpenMPI and their code I can't tell them to use OpenMPI.
>>
>> if processes continue to run after a canceljob / qdel or hitting a  
>> walltime limit then the problem is almost certainly that you are  
>> using a non-tm  job launcher.  tm is the PBS/TORQUE task manager  
>> API. You want to use a job launcher than uses tm to spawn all of  
>> the processes rather than something else like rsh/ssh
>>
>> OpenMPI has native tm support, you just have to make sure it can  
>> find the tm library when you run configure.  For mvapich you can  
>> use mpiexec from Pete Wyckoff at OSC: http://www.osc.edu/~pw/ 
>> mpiexec <http://www.osc.edu/%7Epw/mpiexec>.  Hide your mvapich  
>> mpirun from your users and make them use Pete's mpiexec.  A non-tm  
>> job launcher is nothing but trouble with PBS or TORQUE
>>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>



More information about the torqueusers mailing list