[torqueusers] problem with jobs sharing cores

Coyle, James J [ITACD] jjc at iastate.edu
Thu Feb 9 16:20:44 MST 2012


   We had this issue with OpenMPI and the mca parameter mpi_paffinity_alone

setting mpi_paffinity_alone  gives somewhat better performance than
not setting it due to better cache hits when there is only one job
running on a node.

However, this places the N mpi processes on cores 0 to N-1
so for 3 four process MPI programs running on a 12 core node,
you would have 3 processes each running on cores 0 through 3.

Doing what you are doing, launching 3 jobs using 4 processes each with
openmpi and having mpi_paffinity_alone set on (perhaps by default) would
cause exactly the behavior you are seeing, you would have 3 mpi processes
rank 0 running on core 0, 3 rank 1 processes running on core 1, etc., and no
MPI processes running on cores 4-11.

Perhaps mvapich has a similar mechanism to mpi_paffinity_alone that you are
encountering.  man mpirun should help you figure this out, or you could ask
the cluster admin, or whoever is an expert in using mvapich in your environment.

Below, I have included part of the General run-time tuning portion of the FAQ for OpenMPI
from http://www.open-mpi.org/faq/

I hope this helps

-          Jim

James Coyle, PhD
High Performance Computing Group
 Iowa State Univ.
 web: http://jjc.public.iastate.edu/<http://www.public.iastate.edu/~jjc>

Open MPI 1.2 offers only crude control, with the MCA parameter "mpi_paffinity_alone". For example:
$ mpirun --mca mpi_paffinity_alone 1 -np 4 a.out

(Just like any other MCA parameter, mpi_paffinity_alone can be set
via any of the normal MCA parameter mechanisms<http://www.open-mpi.org/faq/?category=tuning#setting-mca-params>.)
On each node where your job is running, your job's MPI processes will be bound, one-to-one, in the order of their global MPI ranks, to the lowest-numbered processing units (for example, cores or hardware threads) on the node as identified by the OS. Further, memory affinity will also be enabled if it is supported on the node,as described in a different FAQ entry<http://www.open-mpi.org/faq/?category=tuning#maffinity-defs>.
If multiple jobs are launched on the same node in this manner, they will compete for the same processing units and severe performance degradation will likely result. Therefore, this MCA parameter is best used when you know your job will be "alone" on the nodes where it will run.
Since each process is bound to a single processing unit, performance will likely suffer catastrophically if processes are multi-threaded.

From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Zulauf, Michael
Sent: Thursday, February 09, 2012 12:30 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] problem with jobs sharing cores

Hi all. . .

I apologize if this message appears more than once - there was an issue with my email address and list registration (which I hope is now fixed), and so I'm having to resend this. . .

Anyway, where I work, we've had a problem for a while that we haven't been able to resolve.  I'm not certain of the cause - if it's related to Torque, or Maui, or something else.  But here goes. . .

We've got a small cluster of 16 nodes, each with dual hex-core processors.  12 cores per node, 192 cores total.  The problem is that if I launch small jobs, where multiple jobs should be able to share a node without sharing cores, I instead get cores that are running more than one process, while other cores are idle.  The primary executable is WRF (weather prediction model), but the problem occurs for other parallel codes.  The codes have been built to utilize MPI (not OpenMP, or MPI/OpenMP).

As an example, if I launch a series of jobs which request 4 cores each, I get 3 jobs assigned to each node.  That should be fine, as each node has 12 cores, and there should be no need to share cores.  Instead, I get 4 "overloaded" cores (each running 3 processes) and 8 idle cores.  Obviously not an ideal situation.  If I submit only a single small job, in which case it's alone on a node, then it runs great.  Similarly, if I launch a large job which spans more than one node, it also works well - as long as it's not sharing nodes with other jobs.  The problem only occurs (and always occurs) when parallel jobs share a node.  BTW, the qsub command does not explicitly request specific cores, or anything like that.

I'm not the administrator - just the primary user.  The administrator (who was not previously familiar with Torque/Maui) has been struggling with this for a bit, and is rather busy with other duties, so I thought I'd check in here to see if anybody had suggestions I could pass along.

Here are some specifics, as far as I know them:
      HP blade hardware
dual Intel Xeon X5670 processors
      Infiniband interconnect (not an issue in this case?)
the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly)
Torque 3.0.2
PGI7.2-5 compilers
WRF 3.3.1

Any thoughts?  I've probably left out relevant information.  If so, please ask for clarification.


Mike Zulauf
Meteorologist, Lead Senior
Asset Optimization
Iberdrola Renewables
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304  Cell: 503-913-0403

This message is intended for the exclusive attention of the recipient(s) indicated.  Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/e99e2188/attachment-0001.html 

More information about the torqueusers mailing list