diff options
Diffstat (limited to 'kernel/Documentation/cgroups/unified-hierarchy.txt')
-rw-r--r-- | kernel/Documentation/cgroups/unified-hierarchy.txt | 248 |
1 files changed, 217 insertions, 31 deletions
diff --git a/kernel/Documentation/cgroups/unified-hierarchy.txt b/kernel/Documentation/cgroups/unified-hierarchy.txt index eb102fb72..781b1d475 100644 --- a/kernel/Documentation/cgroups/unified-hierarchy.txt +++ b/kernel/Documentation/cgroups/unified-hierarchy.txt @@ -17,15 +17,21 @@ CONTENTS 3. Structural Constraints 3-1. Top-down 3-2. No internal tasks -4. Other Changes - 4-1. [Un]populated Notification - 4-2. Other Core Changes - 4-3. Per-Controller Changes - 4-3-1. blkio - 4-3-2. cpuset - 4-3-3. memory -5. Planned Changes - 5-1. CAP for resource control +4. Delegation + 4-1. Model of delegation + 4-2. Common ancestor rule +5. Other Changes + 5-1. [Un]populated Notification + 5-2. Other Core Changes + 5-3. Controller File Conventions + 5-3-1. Format + 5-3-2. Control Knobs + 5-4. Per-Controller Changes + 5-4-1. io + 5-4-2. cpuset + 5-4-3. memory +6. Planned Changes + 6-1. CAP for resource control 1. Background @@ -101,12 +107,6 @@ root of unified hierarchy can be bound to other hierarchies. This allows mixing unified hierarchy with the traditional multiple hierarchies in a fully backward compatible way. -For development purposes, the following boot parameter makes all -controllers to appear on the unified hierarchy whether supported or -not. - - cgroup__DEVEL__legacy_files_on_dfl - A controller can be moved across hierarchies only after the controller is no longer referenced in its current hierarchy. Because per-cgroup controller states are destroyed asynchronously and controllers may @@ -197,7 +197,7 @@ other issues. The mapping from nice level to weight isn't obvious or universal, and there are various other knobs which simply aren't available for tasks. -The blkio controller implicitly creates a hidden leaf node for each +The io controller implicitly creates a hidden leaf node for each cgroup to host the tasks. The hidden leaf has its own copies of all the knobs with "leaf_" prefixed. While this allows equivalent control over internal tasks, it's with serious drawbacks. It always adds an @@ -245,9 +245,72 @@ cgroup must create children and transfer all its tasks to the children before enabling controllers in its "cgroup.subtree_control" file. -4. Other Changes +4. Delegation + +4-1. Model of delegation + +A cgroup can be delegated to a less privileged user by granting write +access of the directory and its "cgroup.procs" file to the user. Note +that the resource control knobs in a given directory concern the +resources of the parent and thus must not be delegated along with the +directory. + +Once delegated, the user can build sub-hierarchy under the directory, +organize processes as it sees fit and further distribute the resources +it got from the parent. The limits and other settings of all resource +controllers are hierarchical and regardless of what happens in the +delegated sub-hierarchy, nothing can escape the resource restrictions +imposed by the parent. + +Currently, cgroup doesn't impose any restrictions on the number of +cgroups in or nesting depth of a delegated sub-hierarchy; however, +this may in the future be limited explicitly. + + +4-2. Common ancestor rule + +On the unified hierarchy, to write to a "cgroup.procs" file, in +addition to the usual write permission to the file and uid match, the +writer must also have write access to the "cgroup.procs" file of the +common ancestor of the source and destination cgroups. This prevents +delegatees from smuggling processes across disjoint sub-hierarchies. + +Let's say cgroups C0 and C1 have been delegated to user U0 who created +C00, C01 under C0 and C10 under C1 as follows. + + ~~~~~~~~~~~~~ - C0 - C00 + ~ cgroup ~ \ C01 + ~ hierarchy ~ + ~~~~~~~~~~~~~ - C1 - C10 + +C0 and C1 are separate entities in terms of resource distribution +regardless of their relative positions in the hierarchy. The +resources the processes under C0 are entitled to are controlled by +C0's ancestors and may be completely different from C1. It's clear +that the intention of delegating C0 to U0 is allowing U0 to organize +the processes under C0 and further control the distribution of C0's +resources. + +On traditional hierarchies, if a task has write access to "tasks" or +"cgroup.procs" file of a cgroup and its uid agrees with the target, it +can move the target to the cgroup. In the above example, U0 will not +only be able to move processes in each sub-hierarchy but also across +the two sub-hierarchies, effectively allowing it to violate the +organizational and resource restrictions implied by the hierarchical +structure above C0 and C1. + +On the unified hierarchy, let's say U0 wants to write the pid of a +process which has a matching uid and is currently in C10 into +"C00/cgroup.procs". U0 obviously has write access to the file and +migration permission on the process; however, the common ancestor of +the source cgroup C10 and the destination cgroup C00 is above the +points of delegation and U0 would not have write access to its +"cgroup.procs" and thus be denied with -EACCES. -4-1. [Un]populated Notification + +5. Other Changes + +5-1. [Un]populated Notification cgroup users often need a way to determine when a cgroup's subhierarchy becomes empty so that it can be cleaned up. cgroup @@ -272,11 +335,11 @@ is riddled with issues. unnecessarily complicated and probably done this way because event delivery itself was expensive. -Unified hierarchy implements an interface file "cgroup.populated" -which can be used to monitor whether the cgroup's subhierarchy has -tasks in it or not. Its value is 0 if there is no task in the cgroup -and its descendants; otherwise, 1. poll and [id]notify events are -triggered when the value changes. +Unified hierarchy implements "populated" field in "cgroup.events" +interface file which can be used to monitor whether the cgroup's +subhierarchy has tasks in it or not. Its value is 0 if there is no +task in the cgroup and its descendants; otherwise, 1. poll and +[id]notify events are triggered when the value changes. This is significantly lighter and simpler and trivially allows delegating management of subhierarchy - subhierarchy monitoring can @@ -289,7 +352,7 @@ supported and the interface files "release_agent" and "notify_on_release" do not exist. -4-2. Other Core Changes +5-2. Other Core Changes - None of the mount options is allowed. @@ -305,15 +368,138 @@ supported and the interface files "release_agent" and - The "cgroup.clone_children" file is removed. +- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged + to before exiting. If the cgroup is removed before the zombie is + reaped, " (deleted)" is appeneded to the path. + + +5-3. Controller File Conventions + +5-3-1. Format + +In general, all controller files should be in one of the following +formats whenever possible. + +- Values only files + + VAL0 VAL1...\n + +- Flat keyed files + + KEY0 VAL0\n + KEY1 VAL1\n + ... + +- Nested keyed files + + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... + ... + +For a writeable file, the format for writing should generally match +reading; however, controllers may allow omitting later fields or +implement restricted shortcuts for most common use cases. + +For both flat and nested keyed files, only the values for a single key +can be written at a time. For nested keyed files, the sub key pairs +may be specified in any order and not all pairs have to be specified. + + +5-3-2. Control Knobs + +- Settings for a single feature should generally be implemented in a + single file. + +- In general, the root cgroup should be exempt from resource control + and thus shouldn't have resource control knobs. + +- If a controller implements ratio based resource distribution, the + control knob should be named "weight" and have the range [1, 10000] + and 100 should be the default value. The values are chosen to allow + enough and symmetric bias in both directions while keeping it + intuitive (the default is 100%). + +- If a controller implements an absolute resource guarantee and/or + limit, the control knobs should be named "min" and "max" + respectively. If a controller implements best effort resource + gurantee and/or limit, the control knobs should be named "low" and + "high" respectively. + + In the above four control files, the special token "max" should be + used to represent upward infinity for both reading and writing. + +- If a setting has configurable default value and specific overrides, + the default settings should be keyed with "default" and appear as + the first entry in the file. Specific entries can use "default" as + its value to indicate inheritance of the default value. + +- For events which are not very high frequency, an interface file + "events" should be created which lists event key value pairs. + Whenever a notifiable event happens, file modified event should be + generated on the file. + + +5-4. Per-Controller Changes + +5-4-1. io + +- blkio is renamed to io. The interface is overhauled anyway. The + new name is more in line with the other two major controllers, cpu + and memory, and better suited given that it may be used for cgroup + writeback without involving block layer. + +- Everything including stat is always hierarchical making separate + recursive stat files pointless and, as no internal node can have + tasks, leaf weights are meaningless. The operation model is + simplified and the interface is overhauled accordingly. + + io.stat + + The stat file. The reported stats are from the point where + bio's are issued to request_queue. The stats are counted + independent of which policies are enabled. Each line in the + file follows the following format. More fields may later be + added at the end. + + $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS + + io.weight + + The weight setting, currently only available and effective if + cfq-iosched is in use for the target device. The weight is + between 1 and 10000 and defaults to 100. The first line + always contains the default weight in the following format to + use when per-device setting is missing. + + default $WEIGHT + + Subsequent lines list per-device weights of the following + format. + + $MAJ:$MIN $WEIGHT + + Writing "$WEIGHT" or "default $WEIGHT" changes the default + setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight + while "$MAJ:$MIN default" clears it. + + This file is available only on non-root cgroups. + + io.max + + The maximum bandwidth and/or iops setting, only available if + blk-throttle is enabled. The file is of the following format. -4-3. Per-Controller Changes + $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS -4-3-1. blkio + ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are + read/write IOs per second. "max" indicates no limit. Writing + to the file follows the same format but the individual + settings may be omitted or specified in any order. -- blk-throttle becomes properly hierarchical. + This file is available only on non-root cgroups. -4-3-2. cpuset +5-4-2. cpuset - Tasks are kept in empty cpusets after hotplug and take on the masks of the nearest non-empty ancestor, instead of being moved to it. @@ -322,7 +508,7 @@ supported and the interface files "release_agent" and masks of the nearest non-empty ancestor. -4-3-3. memory +5-4-3. memory - use_hierarchy is on by default and the cgroup file for the flag is not created. @@ -407,9 +593,9 @@ supported and the interface files "release_agent" and memory.low, memory.high, and memory.max will use the string "max" to indicate and set the highest possible value. -5. Planned Changes +6. Planned Changes -5-1. CAP for resource control +6-1. CAP for resource control Unified hierarchy will require one of the capabilities(7), which is yet to be decided, for all resource control related knobs. Process |