[torquedev] [Bug 93] New: Resource management semantics of Torque need to be well defined

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Mon Oct 25 06:39:18 MDT 2010


           Summary: Resource management semantics of Torque need to be
                    well defined
           Product: TORQUE
           Version: 3.0.0-alpha
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: critical
          Priority: P5
         Component: pbs_server
        AssignedTo: glen.beane at gmail.com
        ReportedBy: SimonT at mail.muni.cz
                CC: torquedev at supercluster.org
   Estimated Hours: 0.0

Currently Torque includes inconsistent resource management semantics. These
semantics need to be redefined.

* External Schedulers *

>From what I have been told (I only work with plain Torque), external schedulers
(Moab, Maui) send in their run requests a very specific nodespec or directly an
exechost list.

If this is not true then we need to consider what semantics do external
schedulers expect from Torque.

If this is true, then these schedulers can be safely ignored (as far as
resource semantics go).

* Process (ppn) semantics *

Process semantics should be dumped completely. The only thing that they are
useful right now is limiting vmem in a per-process manner.

The number of processes isn't limited by torque (not 100% sure here) and with
the liberal approach towards forking in most Linux software, this wouldn't be a
good idea either.

* Per-job, per-node, per-process resource *

Even when the process semantic is dumped we still need to distinct between
per-node and per-job resources.

For example mem should definitely be a per-node resource while number of matlab
licenses should definitely be a per-job resource.

* Configurable with pre-set defaults or strict *

I would definitely like a configurable approach. Setting flags in the resource
definition (as done in my bug 67) is probably not the best approach (so we need
to come up with something more sane). In both cases we need to define a set of
fully internally supported resources.

This is a list of resources I consider essential:
- ncpus
- mem
- vmem
- walltime
- cputime

Plus we need some generic resources, that are checked (ie. if job requires 4
kitchen-sinks and node only has 2 available, then the job cannot be run), but
don't have any special semantics.

Support without semantics:
- generic per-node counted resource (counted/enforced only on server)
- generic per-job counted resource (counted/enforced only on server)

* Cgroups - Linux specific *

I have been digging through cgroups docs and the good thing is we can replace a
lot of the Linux stuff with cgroups that should work reliably.

Stuff that cgroups can do:
- memory (mem, vmem, oom killer configuration)
- cpusets
- devices (limiting access)
  - should work well for GPUs or any generic HW requiring dedicated access
- frozen containers
- accounting

Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

More information about the torquedev mailing list