Linux Cpuset Support

TORQUE Administrator's Manual - 3.5 Linux Cpuset Support

3.5 Linux Cpuset Support

3.5.1 Cpuset Overview

Linux kernel 2.6 Cpusets are logical, hierarchical groupings of CPUs and units of memory. Once created, individual processes can be placed within a cpuset. The processes will only be allowed to run/access the specified CPUs and memory. Cpusets are managed in a virtual filesystem mounted at /dev/cpuset. New cpusets are created by simply making new directories. Cpusets gain CPUs and memory units by simply writing the unit number to files within the cpuset.

TORQUE’s Cpuset support is new in 2.3.x and should be regarded as experimental and under development.

3.5.2 Cpuset Support

When started, pbs_mom will create an initial top-level cpuset at /dev/cpuset/torque. This cpuset will contain all CPUs and memory of the host machine. If this “torqueset” already exists, it will be left unchanged to allow the administrator to override the default behaviour. All subsequent cpusets are created as within the torqueset.

When a job is started, the “jobset” is created at /dev/cpuset/torque/$jobid, and populated with the CPUs listed in the exec_host job attribute. Also created are individual “tasksets” for each CPU within the jobset. This happens before prologue which allows it to be easily modified. This happens on all nodes.

The top-level batch script process is executed in the jobset. Tasks launched through the TM interface (pbsdsh and PW’s mpiexec) will be executed within the appropriate taskset.

On job exit, all tasksets and the jobset are deleted.

3.5.3 Cpuset Configuration

At the moment, there are no run-time configurations. The support is disabled by default at build-time. Run configure with --enable-cpuset if you would like to test the code.

If enabled, and run on a machine without cpuset support, pbs_mom will simply ignore it and won’t complain.

A run-time pbs_mom boolean needs to be created to enable/disable it.

On the Linux host, the virtual filesystem must be mounted:

mount -t cpuset none /dev/cpuset

3.5.4 Cpuset advantages / disadvantages

Presently, any job can request a single CPU and proceed to use everything available in the machine. This is occasionally done to circumvent policy, but most often is simply an error on the part of the user. Cpuset support will easily constrain the processes to not interfere with other jobs.

Jobs on larger NUMA systems may see a performance boost if jobs can be intelligently assigned to specific CPUs. Jobs may perform better if striped across physical processors, or contained within the fewest number of memory controllers.

TM tasks are constrained to a single core, thus a multi-threaded process could seriously suffer.

3.5.5 Cpuset TODO

  • The code is quite ugly and requires cleanup with correct error handling.
  • No attempt is made to be “smart” about the CPU assignment. We need a mechanism to expose the physical topology to the scheduler, and let the scheduler assign CPUs. Currently, pbs_server assigns “subnodes” to jobs, which are analogous to CPUs.

Proposal: pbs_mom “stringifies” the topology in some unambiguous format to a new node attribute. Moab will be able to read this info, and probably have its own way of supplying/overriding this information in its own configuration. Moab can set a new job attribute with a string that specifies the CPUs to assign. Then pbs_mom will ignore exec_host and use this job attribute instead.

  • Memory isn’t handled at all. All memory units are added to all jobsets and tasksets. So far, it is unclear how memory should be handled.