[torqueusers] problem with jobs sharing cores
Michael.Zulauf at iberdrolaren.com
Mon Feb 13 16:22:36 MST 2012
A big thinks to Ken Nielson, Jim Coyle, and Fotis Georgatos - based on
their help I made some significant progress today. I haven't completely
worked out all the details yet, but I've found that by switching to
OpenMPI, I do not get the same problematic behavior. So it seems most
likely that the source of the problem has something to do with a
configuration detail of our mvapich2 installation.
According to some earlier benchmarking I'd done, mvapich2 seems to offer
better performance across our infiniband interconnect, so I'd like to
see if I can get to the bottom of the issue with that. Alternatively, I
could use mvapich2 for the "large" jobs (which span multiple nodes), and
use OpenMPI for the "small" jobs (which will share a node with other
jobs). I'd prefer to avoid the "dual MPI" alternative, as then we'd
have to build all executables twice, and some of them are a bit tricky.
Still, I suppose it's an option.
In any case, thanks again. Now maybe I can go haunt the mvapich2 lists,
or at least start trying to dig up the solution in that documentation.
Meteorologist, Lead Senior
1125 NW Couch, Suite 700
Portland, OR 97209
Office: 503-478-6304 Cell: 503-913-0403
This message is intended for the exclusive attention of the recipient(s) indicated. Any information contained herein is strictly confidential and privileged. If you are not the intended recipient, please notify us by return e-mail and delete this message from your computer system. Any unauthorized use, reproduction, alteration, filing or sending of this message and/or any attached files may lead to legal action being taken against the party(ies) responsible for said unauthorized use. Any opinion expressed herein is solely that of the author(s) and does not necessarily represent the opinion of the Company. The sender does not guarantee the integrity, speed or safety of this message, and does not accept responsibility for any possible damage arising from the interception, incorporation of viruses, or any other damage as a result of manipulation.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers