summaryrefslogtreecommitdiffstats
path: root/kernel/Documentation/cgroups/unified-hierarchy.txt
diff options
context:
space:
mode:
Diffstat (limited to 'kernel/Documentation/cgroups/unified-hierarchy.txt')
-rw-r--r--kernel/Documentation/cgroups/unified-hierarchy.txt248
1 files changed, 217 insertions, 31 deletions
diff --git a/kernel/Documentation/cgroups/unified-hierarchy.txt b/kernel/Documentation/cgroups/unified-hierarchy.txt
index eb102fb72..781b1d475 100644
--- a/kernel/Documentation/cgroups/unified-hierarchy.txt
+++ b/kernel/Documentation/cgroups/unified-hierarchy.txt
@@ -17,15 +17,21 @@ CONTENTS
3. Structural Constraints
3-1. Top-down
3-2. No internal tasks
-4. Other Changes
- 4-1. [Un]populated Notification
- 4-2. Other Core Changes
- 4-3. Per-Controller Changes
- 4-3-1. blkio
- 4-3-2. cpuset
- 4-3-3. memory
-5. Planned Changes
- 5-1. CAP for resource control
+4. Delegation
+ 4-1. Model of delegation
+ 4-2. Common ancestor rule
+5. Other Changes
+ 5-1. [Un]populated Notification
+ 5-2. Other Core Changes
+ 5-3. Controller File Conventions
+ 5-3-1. Format
+ 5-3-2. Control Knobs
+ 5-4. Per-Controller Changes
+ 5-4-1. io
+ 5-4-2. cpuset
+ 5-4-3. memory
+6. Planned Changes
+ 6-1. CAP for resource control
1. Background
@@ -101,12 +107,6 @@ root of unified hierarchy can be bound to other hierarchies. This
allows mixing unified hierarchy with the traditional multiple
hierarchies in a fully backward compatible way.
-For development purposes, the following boot parameter makes all
-controllers to appear on the unified hierarchy whether supported or
-not.
-
- cgroup__DEVEL__legacy_files_on_dfl
-
A controller can be moved across hierarchies only after the controller
is no longer referenced in its current hierarchy. Because per-cgroup
controller states are destroyed asynchronously and controllers may
@@ -197,7 +197,7 @@ other issues. The mapping from nice level to weight isn't obvious or
universal, and there are various other knobs which simply aren't
available for tasks.
-The blkio controller implicitly creates a hidden leaf node for each
+The io controller implicitly creates a hidden leaf node for each
cgroup to host the tasks. The hidden leaf has its own copies of all
the knobs with "leaf_" prefixed. While this allows equivalent control
over internal tasks, it's with serious drawbacks. It always adds an
@@ -245,9 +245,72 @@ cgroup must create children and transfer all its tasks to the children
before enabling controllers in its "cgroup.subtree_control" file.
-4. Other Changes
+4. Delegation
+
+4-1. Model of delegation
+
+A cgroup can be delegated to a less privileged user by granting write
+access of the directory and its "cgroup.procs" file to the user. Note
+that the resource control knobs in a given directory concern the
+resources of the parent and thus must not be delegated along with the
+directory.
+
+Once delegated, the user can build sub-hierarchy under the directory,
+organize processes as it sees fit and further distribute the resources
+it got from the parent. The limits and other settings of all resource
+controllers are hierarchical and regardless of what happens in the
+delegated sub-hierarchy, nothing can escape the resource restrictions
+imposed by the parent.
+
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may in the future be limited explicitly.
+
+
+4-2. Common ancestor rule
+
+On the unified hierarchy, to write to a "cgroup.procs" file, in
+addition to the usual write permission to the file and uid match, the
+writer must also have write access to the "cgroup.procs" file of the
+common ancestor of the source and destination cgroups. This prevents
+delegatees from smuggling processes across disjoint sub-hierarchies.
+
+Let's say cgroups C0 and C1 have been delegated to user U0 who created
+C00, C01 under C0 and C10 under C1 as follows.
+
+ ~~~~~~~~~~~~~ - C0 - C00
+ ~ cgroup ~ \ C01
+ ~ hierarchy ~
+ ~~~~~~~~~~~~~ - C1 - C10
+
+C0 and C1 are separate entities in terms of resource distribution
+regardless of their relative positions in the hierarchy. The
+resources the processes under C0 are entitled to are controlled by
+C0's ancestors and may be completely different from C1. It's clear
+that the intention of delegating C0 to U0 is allowing U0 to organize
+the processes under C0 and further control the distribution of C0's
+resources.
+
+On traditional hierarchies, if a task has write access to "tasks" or
+"cgroup.procs" file of a cgroup and its uid agrees with the target, it
+can move the target to the cgroup. In the above example, U0 will not
+only be able to move processes in each sub-hierarchy but also across
+the two sub-hierarchies, effectively allowing it to violate the
+organizational and resource restrictions implied by the hierarchical
+structure above C0 and C1.
+
+On the unified hierarchy, let's say U0 wants to write the pid of a
+process which has a matching uid and is currently in C10 into
+"C00/cgroup.procs". U0 obviously has write access to the file and
+migration permission on the process; however, the common ancestor of
+the source cgroup C10 and the destination cgroup C00 is above the
+points of delegation and U0 would not have write access to its
+"cgroup.procs" and thus be denied with -EACCES.
-4-1. [Un]populated Notification
+
+5. Other Changes
+
+5-1. [Un]populated Notification
cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup
@@ -272,11 +335,11 @@ is riddled with issues.
unnecessarily complicated and probably done this way because event
delivery itself was expensive.
-Unified hierarchy implements an interface file "cgroup.populated"
-which can be used to monitor whether the cgroup's subhierarchy has
-tasks in it or not. Its value is 0 if there is no task in the cgroup
-and its descendants; otherwise, 1. poll and [id]notify events are
-triggered when the value changes.
+Unified hierarchy implements "populated" field in "cgroup.events"
+interface file which can be used to monitor whether the cgroup's
+subhierarchy has tasks in it or not. Its value is 0 if there is no
+task in the cgroup and its descendants; otherwise, 1. poll and
+[id]notify events are triggered when the value changes.
This is significantly lighter and simpler and trivially allows
delegating management of subhierarchy - subhierarchy monitoring can
@@ -289,7 +352,7 @@ supported and the interface files "release_agent" and
"notify_on_release" do not exist.
-4-2. Other Core Changes
+5-2. Other Core Changes
- None of the mount options is allowed.
@@ -305,15 +368,138 @@ supported and the interface files "release_agent" and
- The "cgroup.clone_children" file is removed.
+- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
+ to before exiting. If the cgroup is removed before the zombie is
+ reaped, " (deleted)" is appeneded to the path.
+
+
+5-3. Controller File Conventions
+
+5-3-1. Format
+
+In general, all controller files should be in one of the following
+formats whenever possible.
+
+- Values only files
+
+ VAL0 VAL1...\n
+
+- Flat keyed files
+
+ KEY0 VAL0\n
+ KEY1 VAL1\n
+ ...
+
+- Nested keyed files
+
+ KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+ KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+ ...
+
+For a writeable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time. For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+5-3-2. Control Knobs
+
+- Settings for a single feature should generally be implemented in a
+ single file.
+
+- In general, the root cgroup should be exempt from resource control
+ and thus shouldn't have resource control knobs.
+
+- If a controller implements ratio based resource distribution, the
+ control knob should be named "weight" and have the range [1, 10000]
+ and 100 should be the default value. The values are chosen to allow
+ enough and symmetric bias in both directions while keeping it
+ intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+ limit, the control knobs should be named "min" and "max"
+ respectively. If a controller implements best effort resource
+ gurantee and/or limit, the control knobs should be named "low" and
+ "high" respectively.
+
+ In the above four control files, the special token "max" should be
+ used to represent upward infinity for both reading and writing.
+
+- If a setting has configurable default value and specific overrides,
+ the default settings should be keyed with "default" and appear as
+ the first entry in the file. Specific entries can use "default" as
+ its value to indicate inheritance of the default value.
+
+- For events which are not very high frequency, an interface file
+ "events" should be created which lists event key value pairs.
+ Whenever a notifiable event happens, file modified event should be
+ generated on the file.
+
+
+5-4. Per-Controller Changes
+
+5-4-1. io
+
+- blkio is renamed to io. The interface is overhauled anyway. The
+ new name is more in line with the other two major controllers, cpu
+ and memory, and better suited given that it may be used for cgroup
+ writeback without involving block layer.
+
+- Everything including stat is always hierarchical making separate
+ recursive stat files pointless and, as no internal node can have
+ tasks, leaf weights are meaningless. The operation model is
+ simplified and the interface is overhauled accordingly.
+
+ io.stat
+
+ The stat file. The reported stats are from the point where
+ bio's are issued to request_queue. The stats are counted
+ independent of which policies are enabled. Each line in the
+ file follows the following format. More fields may later be
+ added at the end.
+
+ $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
+
+ io.weight
+
+ The weight setting, currently only available and effective if
+ cfq-iosched is in use for the target device. The weight is
+ between 1 and 10000 and defaults to 100. The first line
+ always contains the default weight in the following format to
+ use when per-device setting is missing.
+
+ default $WEIGHT
+
+ Subsequent lines list per-device weights of the following
+ format.
+
+ $MAJ:$MIN $WEIGHT
+
+ Writing "$WEIGHT" or "default $WEIGHT" changes the default
+ setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
+ while "$MAJ:$MIN default" clears it.
+
+ This file is available only on non-root cgroups.
+
+ io.max
+
+ The maximum bandwidth and/or iops setting, only available if
+ blk-throttle is enabled. The file is of the following format.
-4-3. Per-Controller Changes
+ $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
-4-3-1. blkio
+ ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
+ read/write IOs per second. "max" indicates no limit. Writing
+ to the file follows the same format but the individual
+ settings may be omitted or specified in any order.
-- blk-throttle becomes properly hierarchical.
+ This file is available only on non-root cgroups.
-4-3-2. cpuset
+5-4-2. cpuset
- Tasks are kept in empty cpusets after hotplug and take on the masks
of the nearest non-empty ancestor, instead of being moved to it.
@@ -322,7 +508,7 @@ supported and the interface files "release_agent" and
masks of the nearest non-empty ancestor.
-4-3-3. memory
+5-4-3. memory
- use_hierarchy is on by default and the cgroup file for the flag is
not created.
@@ -407,9 +593,9 @@ supported and the interface files "release_agent" and
memory.low, memory.high, and memory.max will use the string "max" to
indicate and set the highest possible value.
-5. Planned Changes
+6. Planned Changes
-5-1. CAP for resource control
+6-1. CAP for resource control
Unified hierarchy will require one of the capabilities(7), which is
yet to be decided, for all resource control related knobs. Process