diff options
author | Yunhong Jiang <yunhong.jiang@intel.com> | 2015-08-04 12:17:53 -0700 |
---|---|---|
committer | Yunhong Jiang <yunhong.jiang@intel.com> | 2015-08-04 15:44:42 -0700 |
commit | 9ca8dbcc65cfc63d6f5ef3312a33184e1d726e00 (patch) | |
tree | 1c9cafbcd35f783a87880a10f85d1a060db1a563 /kernel/Documentation/scheduler/sched-domains.txt | |
parent | 98260f3884f4a202f9ca5eabed40b1354c489b29 (diff) |
Add the rt linux 4.1.3-rt3 as base
Import the rt linux 4.1.3-rt3 as OPNFV kvm base.
It's from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-4.1.y-rt and
the base is:
commit 0917f823c59692d751951bf5ea699a2d1e2f26a2
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Sat Jul 25 12:13:34 2015 +0200
Prepare v4.1.3-rt3
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
We lose all the git history this way and it's not good. We
should apply another opnfv project repo in future.
Change-Id: I87543d81c9df70d99c5001fbdf646b202c19f423
Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Diffstat (limited to 'kernel/Documentation/scheduler/sched-domains.txt')
-rw-r--r-- | kernel/Documentation/scheduler/sched-domains.txt | 77 |
1 files changed, 77 insertions, 0 deletions
diff --git a/kernel/Documentation/scheduler/sched-domains.txt b/kernel/Documentation/scheduler/sched-domains.txt new file mode 100644 index 000000000..4af80b1c0 --- /dev/null +++ b/kernel/Documentation/scheduler/sched-domains.txt @@ -0,0 +1,77 @@ +Each CPU has a "base" scheduling domain (struct sched_domain). The domain +hierarchy is built from these base domains via the ->parent pointer. ->parent +MUST be NULL terminated, and domain structures should be per-CPU as they are +locklessly updated. + +Each scheduling domain spans a number of CPUs (stored in the ->span field). +A domain's span MUST be a superset of it child's span (this restriction could +be relaxed if the need arises), and a base domain for CPU i MUST span at least +i. The top domain for each CPU will generally span all CPUs in the system +although strictly it doesn't have to, but this could lead to a case where some +CPUs will never be given tasks to run unless the CPUs allowed mask is +explicitly set. A sched domain's span means "balance process load among these +CPUs". + +Each scheduling domain must have one or more CPU groups (struct sched_group) +which are organised as a circular one way linked list from the ->groups +pointer. The union of cpumasks of these groups MUST be the same as the +domain's span. The intersection of cpumasks from any two of these groups +MUST be the empty set. The group pointed to by the ->groups pointer MUST +contain the CPU to which the domain belongs. Groups may be shared among +CPUs as they contain read only data after they have been set up. + +Balancing within a sched domain occurs between groups. That is, each group +is treated as one entity. The load of a group is defined as the sum of the +load of each of its member CPUs, and only when the load of a group becomes +out of balance are tasks moved between groups. + +In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU +through scheduler_tick(). It raises a softirq after the next regularly scheduled +rebalancing event for the current runqueue has arrived. The actual load +balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run +in softirq context (SCHED_SOFTIRQ). + +The latter function takes two arguments: the current CPU and whether it was idle +at the time the scheduler_tick() happened and iterates over all sched domains +our CPU is on, starting from its base domain and going up the ->parent chain. +While doing that, it checks to see if the current domain has exhausted its +rebalance interval. If so, it runs load_balance() on that domain. It then checks +the parent sched_domain (if it exists), and the parent of the parent and so +forth. + +Initially, load_balance() finds the busiest group in the current sched domain. +If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in +that group. If it manages to find such a runqueue, it locks both our initial +CPU's runqueue and the newly found busiest one and starts moving tasks from it +to our runqueue. The exact number of tasks amounts to an imbalance previously +computed while iterating over this sched domain's groups. + +*** Implementing sched domains *** +The "base" domain will "span" the first level of the hierarchy. In the case +of SMT, you'll span all siblings of the physical CPU, with each group being +a single virtual CPU. + +In SMP, the parent of the base domain will span all physical CPUs in the +node. Each group being a single physical CPU. Then with NUMA, the parent +of the SMP domain will span the entire machine, with each group having the +cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, +might have just one domain covering its one NUMA level. + +The implementor should read comments in include/linux/sched.h: +struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of +the specifics and what to tune. + +Architectures may retain the regular override the default SD_*_INIT flags +while using the generic domain builder in kernel/sched/core.c if they wish to +retain the traditional SMT->SMP->NUMA topology (or some subset of that). This +can be done by #define'ing ARCH_HASH_SCHED_TUNE. + +Alternatively, the architecture may completely override the generic domain +builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your +arch_init_sched_domains function. This function will attach domains to all +CPUs using cpu_attach_domain. + +The sched-domains debugging infrastructure can be enabled by enabling +CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains +which should catch most possible errors (described above). It also prints out +the domain structure in a visual format. |