[torqueusers] [PATCH 0/3] use cgroup to limit the cpu and memory usage of jobs

levin li levin108 at gmail.com
Thu Nov 29 20:00:02 MST 2012


Craig,

Using cgroups at the OS level is another strategy, we may assign fixed 
resources for an user, but it's different from our demands, the goal 
that we use cgroup is to limit a job which runs on a specified node 
won't use too much resource that affects other jobs running on the same 
node.

And yes, if the processes are launched by ssh, then the processes can 
not be controlled by pbs_mom, and yes, they can not be controlled by 
cgroup currently, but in a cluster where the resource are controlled by 
torque, I think users should start jobs under control of pbs_mom.


Thanks,

Levin

On 2012年11月30日 02:23, Craig Tierney - NOAA Affiliate wrote:
> Levin,
>
> Would you mind explaining why I would want to do this?  We are starting
> to use cgroups at the OS level to make sure that all user processes
> (sum) cannot blow out physical memory (as the nodes are diskless and
> have no swap).  Does your patches just make sure that cgroups are used
> to restrict memory usage from processes launched from the pbs_mom?
>
> What if the process launched ssh, which ran a process?  Would the
> process still be constrained under cgroups?
>
> Thanks,
> Craig
>
>
> On Tue, Nov 20, 2012 at 2:02 AM, levin li <levin108 at gmail.com
> <mailto:levin108 at gmail.com>> wrote:
>
>     We're using torque for a while, but it doesn't have memory isolation
>     at present,
>     since cgroup is now widely used for resource isolation, I wrote this
>     pathset for
>     torque to isolate cpuset and memory between jobs, we can use cgroup
>     to replace
>     libcpuset, which is more complicated to use than cgroup.
>
>     With this patchset, we can use qmgr to enable/disable cgroup:
>
>     Enable:
>
>     qmgr -c "set server cgroup_enable = True"
>
>     Disable:
>
>     qmgr -c "set server cgroup_enable = False"
>
>     Before we want to use cgroup in torque, we should mount cgroup first:
>
>     mkdir /dev/torque
>     mount -t cgroup -o cpuset,memory torque /dev/torque
>
>     I wrote a MPI program to test this function, the job script:
>
>     -------------------------------------
>     #!/bin/bash
>
>     #PBS -l nodes=11:ppn=1,pmem=60
>
>     mpirun --hostfile $PBS_NODEFILE ./mpi
>     ------------------------------------
>
>     Then we submit two jobs like this, and in every MPI process, we
>     malloc 100M memory
>     and in order to make it filled by physical memory, we write 100M
>     data to this
>     memory chunk, let's see the test result:
>
>     Before cgroup is enabled:
>
>     [root at vkvm050]# ps -e -o args,psr,rss
>     orted -mca ess env -mca ort   0  2292
>     ./mpi                         1 106504
>     orted -mca ess env -mca ort   0  2292
>     ./mpi                         0 106500
>
>     After cgroup is enabled:
>
>     [root at vkvm050]# ps -e -o args,psr,rss
>     orted -mca ess env -mca ort   1  1860
>     ./mpi                         1 46448
>     orted -mca ess env -mca ort   0  1860
>     ./mpi                         0 59856
>
>
>     Thanks,
>
>     levin
>
>     levin li (3):
>        pbs_server: add cgroup_enable to server attribute
>        resmom: add mom_cgroup.[c|h] to repo
>        resmom: create cgroup for jobs to limit cpu and mem usage
>
>       src/include/pbs_ifl.h         |    2 +
>       src/include/pbs_job.h         |    1 +
>       src/include/qmgr_svr_public.h |    1 +
>       src/include/server.h          |    1 +
>       src/resmom/Makefile.am        |    4 +-
>       src/resmom/Makefile.in        |    9 +-
>       src/resmom/catch_child.c      |    4 +
>       src/resmom/mom_cgroup.c       |  445
>     +++++++++++++++++++++++++++++++++++++++++
>       src/resmom/mom_cgroup.h       |   14 ++
>       src/resmom/mom_comm.c         |   10 +-
>       src/resmom/mom_main.c         |   21 ++-
>       src/resmom/start_exec.c       |   19 ++-
>       src/server/job_attr_def.c     |   13 ++
>       src/server/req_quejob.c       |    7 +
>       src/server/svr_attr_def.c     |   13 ++
>       15 files changed, 554 insertions(+), 10 deletions(-)
>       create mode 100644 src/resmom/mom_cgroup.c
>       create mode 100644 src/resmom/mom_cgroup.h
>
>     --
>     1.7.6.1
>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



More information about the torqueusers mailing list