[torqueusers] problem with jobs sharing cores

Zulauf, Michael Michael.Zulauf at iberdrolaren.com
Thu Feb 9 11:30:09 MST 2012


Hi all. . .

 

I apologize if this message appears more than once - there was an issue
with my email address and list registration (which I hope is now fixed),
and so I'm having to resend this. . .

 

Anyway, where I work, we've had a problem for a while that we haven't
been able to resolve.  I'm not certain of the cause - if it's related to
Torque, or Maui, or something else.  But here goes. . .

 

We've got a small cluster of 16 nodes, each with dual hex-core
processors.  12 cores per node, 192 cores total.  The problem is that if
I launch small jobs, where multiple jobs should be able to share a node
without sharing cores, I instead get cores that are running more than
one process, while other cores are idle.  The primary executable is WRF
(weather prediction model), but the problem occurs for other parallel
codes.  The codes have been built to utilize MPI (not OpenMP, or
MPI/OpenMP).

 

As an example, if I launch a series of jobs which request 4 cores each,
I get 3 jobs assigned to each node.  That should be fine, as each node
has 12 cores, and there should be no need to share cores.  Instead, I
get 4 "overloaded" cores (each running 3 processes) and 8 idle cores.
Obviously not an ideal situation.  If I submit only a single small job,
in which case it's alone on a node, then it runs great.  Similarly, if I
launch a large job which spans more than one node, it also works well -
as long as it's not sharing nodes with other jobs.  The problem only
occurs (and always occurs) when parallel jobs share a node.  BTW, the
qsub command does not explicitly request specific cores, or anything
like that.

 

I'm not the administrator - just the primary user.  The administrator
(who was not previously familiar with Torque/Maui) has been struggling
with this for a bit, and is rather busy with other duties, so I thought
I'd check in here to see if anybody had suggestions I could pass along.

 

Here are some specifics, as far as I know them:

      HP blade hardware

dual Intel Xeon X5670 processors

      Infiniband interconnect (not an issue in this case?)

the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is
exactly)

Torque 3.0.2

mvapich2-1.7rc1

PGI7.2-5 compilers

WRF 3.3.1

 

Any thoughts?  I've probably left out relevant information.  If so,
please ask for clarification.

 

Thanks,

Mike

 

-- 

Mike Zulauf

Meteorologist, Lead Senior

Asset Optimization 

Iberdrola Renewables

1125 NW Couch, Suite 700

Portland, OR 97209

Office: 503-478-6304  Cell: 503-913-0403

 


This message is intended for the exclusive attention of the recipient(s) indicated.  Any information contained herein is strictly confidential and privileged.  If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120209/ec5549a6/attachment-0001.html 


More information about the torqueusers mailing list