[torqueusers] cpusets: Controlling the "mems" allocation for jobs

David Beer dbeer at adaptivecomputing.com
Wed Mar 27 22:44:18 MDT 2013


Michael, 

Have you tried submitting with -n on your qsub line? This is meant to grant exclusive access and I believe it expands the cpuset for the job.  

David

Michael Sternberg <sternberg at anl.gov> wrote:

>Hello users,
>
>Is there a way to control the "mems" component of cpusets in user-space by parameters to qsub?
>
>
>Background: On occasion users here need to run jobs that need more memory per process than physically available on a node. In other words, trade CPU for memory by leaving a few cores idle on a node and dividing up the entire memory among the fewer active cores. The following approach, leveraging Moab's "naccesspolicy" parameter worked well (on 8-core nodes, say) before cpusets:
>
>	-------------------------------------------------
>	#PBS -l nodes=N:ppn=2
>	#PBS -l naccesspolicy=SINGLEJOB
>	...
>	mpirun -machinefile $PBS_NODEFILE -np $PBS_NP \
>		fooprog
>	-------------------------------------------------
>
>However, now with cpusets in use (which I like because of the memory pressure monitoring + kill feature), jobs that request ppn < n_cores will only get a *commensurate* amount of memory. Specifically, I have nodes with 2 memory banks and 2 CPUs of 4 cores each. Torque releases only one of these banks for jobs with ppn <= 4.
>
>FOr instance, a job with
>
>	#PBS -l nodes=1:ppn=2
>
>   gets on the execution node's syslog:
>
>	pbs_mom: LOG_INFO::create_job_cpuset, creating cpuset for job 307606.sched1.carboncluster: 2 cpus (0,1), 1 mems (0)
>
>Conversely, a job with
>
>	#PBS -l nodes=1:ppn=5
>    gets:
>	pbs_mom: LOG_INFO::create_job_cpuset, creating cpuset for job 307607.sched1.carboncluster: 5 cpus (0-4), 2 mems (0,1)
>
>Note 1 mems vs. 2 mems. If a job running on 1 mems needs more than 50% physical memory the node starts swapping - understandable but annoying.
>
>
>I suggest to users at the moment to simply request entire nodes on the TORQUE level with ppn=8 and use the "mpirun -npernode M" option (M = 1 to 7) to restrict use under MPI:
>
>	-------------------------------------------------
>	#PBS -l nodes=N:ppn=8
>	...
>	my_ppn=3
>	mpirun -machinefile $PBS_NODEFILE -npernode $my_ppn \
>		fooprog
>	-------------------------------------------------
>
>The drawback is that this is specific to [Open]MPI. I'd love to see a solution that sets up a ready-made PBS_NODEFILE exactly as requested from qsub, including cases with multiple requests like "nodes=1:ppn=1+2:ppn=4". The "naccesspolicy" solution did that, but short-changed on memory. I used to have a helper script ppnpick that can be used to "thin out" $PBS_NODEFILE, but that means some extra steps in the job file like creating the alternate machinefile.
>
>I tried the various [p][v]mem resources, but they did not seem to do it - I never had success working with those (per process, per node, or such).
>
>
>I'm using TORQUE-4.1.4 on CentOS-5.8.
>
>
>With best regards,
>Michael
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list