[torqueusers] problem with jobs sharing cores

Ken Nielson knielson at adaptivecomputing.com
Thu Feb 9 13:27:59 MST 2012


----- Original Message -----
> From: "Michael Zulauf" <Michael.Zulauf at iberdrolaren.com>
> To: torqueusers at supercluster.org
> Sent: Thursday, February 9, 2012 11:30:09 AM
> Subject: [torqueusers] problem with jobs sharing cores
> 
> 
> 
> 
> 
> Hi all. . .
> 
> 
> 
> I apologize if this message appears more than once – there was an
> issue with my email address and list registration (which I hope is
> now fixed), and so I’m having to resend this. . .
> 
> 
> 
> Anyway, where I work, we’ve had a problem for a while that we haven’t
> been able to resolve. I’m not certain of the cause - if it’s related
> to Torque, or Maui, or something else. But here goes. . .
> 
> 
> 
> We’ve got a small cluster of 16 nodes, each with dual hex-core
> processors. 12 cores per node, 192 cores total. The problem is that
> if I launch small jobs, where multiple jobs should be able to share
> a node without sharing cores, I instead get cores that are running
> more than one process, while other cores are idle. The primary
> executable is WRF (weather prediction model), but the problem occurs
> for other parallel codes. The codes have been built to utilize MPI
> (not OpenMP, or MPI/OpenMP).
> 
> 

You do not really schedule cores with Maui/TORQUE or any other scheduler/resource manager. However, there are ways to make sure you get unique cores for your job. In TORQUE use CPUSETs http://www.adaptivecomputing.com/resources/docs/torque/2-5-9/3.5linuxcpusets.php.

When you set the np count in the nodes file it is not physically tied to the number of processors on the node. It is really a count that says I have this many execution slots available on this node. By far most nodes are set to the number of cores available. Even then, however, when jobs are scheduled they are managed by the OS which will run the jobs anywhere it sees fit. CPUSETs allow the user to reserve 1 or more cores exclusively for their job. Their job will not run outside of the CPUSET and no other processes can use their CPUSET either.

Ken


More information about the torqueusers mailing list