[torqueusers] cpusets: Controlling the "mems" allocation for jobs

Michael Sternberg sternberg at anl.gov
Wed Mar 27 16:00:21 MDT 2013


Hello users,

Is there a way to control the "mems" component of cpusets in user-space by parameters to qsub?


Background: On occasion users here need to run jobs that need more memory per process than physically available on a node. In other words, trade CPU for memory by leaving a few cores idle on a node and dividing up the entire memory among the fewer active cores. The following approach, leveraging Moab's "naccesspolicy" parameter worked well (on 8-core nodes, say) before cpusets:

	-------------------------------------------------
	#PBS -l nodes=N:ppn=2
	#PBS -l naccesspolicy=SINGLEJOB
	...
	mpirun -machinefile $PBS_NODEFILE -np $PBS_NP \
		fooprog
	-------------------------------------------------

However, now with cpusets in use (which I like because of the memory pressure monitoring + kill feature), jobs that request ppn < n_cores will only get a *commensurate* amount of memory. Specifically, I have nodes with 2 memory banks and 2 CPUs of 4 cores each. Torque releases only one of these banks for jobs with ppn <= 4.

FOr instance, a job with

	#PBS -l nodes=1:ppn=2

   gets on the execution node's syslog:

	pbs_mom: LOG_INFO::create_job_cpuset, creating cpuset for job 307606.sched1.carboncluster: 2 cpus (0,1), 1 mems (0)

Conversely, a job with

	#PBS -l nodes=1:ppn=5
    gets:
	pbs_mom: LOG_INFO::create_job_cpuset, creating cpuset for job 307607.sched1.carboncluster: 5 cpus (0-4), 2 mems (0,1)

Note 1 mems vs. 2 mems. If a job running on 1 mems needs more than 50% physical memory the node starts swapping - understandable but annoying.


I suggest to users at the moment to simply request entire nodes on the TORQUE level with ppn=8 and use the "mpirun -npernode M" option (M = 1 to 7) to restrict use under MPI:

	-------------------------------------------------
	#PBS -l nodes=N:ppn=8
	...
	my_ppn=3
	mpirun -machinefile $PBS_NODEFILE -npernode $my_ppn \
		fooprog
	-------------------------------------------------

The drawback is that this is specific to [Open]MPI. I'd love to see a solution that sets up a ready-made PBS_NODEFILE exactly as requested from qsub, including cases with multiple requests like "nodes=1:ppn=1+2:ppn=4". The "naccesspolicy" solution did that, but short-changed on memory. I used to have a helper script ppnpick that can be used to "thin out" $PBS_NODEFILE, but that means some extra steps in the job file like creating the alternate machinefile.

I tried the various [p][v]mem resources, but they did not seem to do it - I never had success working with those (per process, per node, or such).


I'm using TORQUE-4.1.4 on CentOS-5.8.


With best regards,
Michael


More information about the torqueusers mailing list