From e09b41010ba33a20a87472ee821fa407a5b8da36 Mon Sep 17 00:00:00 2001 From: José Pekkarinen Date: Mon, 11 Apr 2016 10:41:07 +0300 Subject: These changes are the raw update to linux-4.4.6-rt14. Kernel sources are taken from kernel.org, and rt patch from the rt wiki download page. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During the rebasing, the following patch collided: Force tick interrupt and get rid of softirq magic(I70131fb85). Collisions have been removed because its logic was found on the source already. Change-Id: I7f57a4081d9deaa0d9ccfc41a6c8daccdee3b769 Signed-off-by: José Pekkarinen --- kernel/Documentation/x86/boot.txt | 3 +- kernel/Documentation/x86/entry_64.txt | 12 +- kernel/Documentation/x86/kernel-stacks | 141 +++++++++++++++++++++++ kernel/Documentation/x86/mtrr.txt | 30 ++++- kernel/Documentation/x86/pat.txt | 48 +++++++- kernel/Documentation/x86/x86_64/boot-options.txt | 3 + kernel/Documentation/x86/x86_64/kernel-stacks | 101 ---------------- kernel/Documentation/x86/zero-page.txt | 3 +- 8 files changed, 223 insertions(+), 118 deletions(-) create mode 100644 kernel/Documentation/x86/kernel-stacks delete mode 100644 kernel/Documentation/x86/x86_64/kernel-stacks (limited to 'kernel/Documentation/x86') diff --git a/kernel/Documentation/x86/boot.txt b/kernel/Documentation/x86/boot.txt index 88b85899d..9da6f3512 100644 --- a/kernel/Documentation/x86/boot.txt +++ b/kernel/Documentation/x86/boot.txt @@ -406,7 +406,7 @@ Protocol: 2.00+ - If 0, the protected-mode code is loaded at 0x10000. - If 1, the protected-mode code is loaded at 0x100000. - Bit 1 (kernel internal): ALSR_FLAG + Bit 1 (kernel internal): KASLR_FLAG - Used internally by the compressed kernel to communicate KASLR status to kernel proper. If 1, KASLR enabled. @@ -1124,7 +1124,6 @@ The boot loader *must* fill out the following fields in bp, o hdr.code32_start o hdr.cmd_line_ptr - o hdr.cmdline_size o hdr.ramdisk_image (if applicable) o hdr.ramdisk_size (if applicable) diff --git a/kernel/Documentation/x86/entry_64.txt b/kernel/Documentation/x86/entry_64.txt index 9132b8617..c1df8eba9 100644 --- a/kernel/Documentation/x86/entry_64.txt +++ b/kernel/Documentation/x86/entry_64.txt @@ -1,14 +1,14 @@ This file documents some of the kernel entries in -arch/x86/kernel/entry_64.S. A lot of this explanation is adapted from +arch/x86/entry/entry_64.S. A lot of this explanation is adapted from an email from Ingo Molnar: http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu> The x86 architecture has quite a few different ways to jump into kernel code. Most of these entry points are registered in -arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S -for 64-bit, arch/x86/kernel/entry_32.S for 32-bit and finally -arch/x86/ia32/ia32entry.S which implements the 32-bit compatibility +arch/x86/kernel/traps.c and implemented in arch/x86/entry/entry_64.S +for 64-bit, arch/x86/entry/entry_32.S for 32-bit and finally +arch/x86/entry/entry_64_compat.S which implements the 32-bit compatibility syscall entry points and thus provides for 32-bit processes the ability to execute syscalls when running on 64-bit kernels. @@ -18,10 +18,10 @@ Some of these entries are: - system_call: syscall instruction from 64-bit code. - - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall + - entry_INT80_compat: int 0x80 from 32-bit or 64-bit code; compat syscall either way. - - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit + - entry_INT80_compat, ia32_sysenter: syscall and sysenter from 32-bit code - interrupt: An array of entries. Every IDT vector that doesn't diff --git a/kernel/Documentation/x86/kernel-stacks b/kernel/Documentation/x86/kernel-stacks new file mode 100644 index 000000000..9a0aa4d3a --- /dev/null +++ b/kernel/Documentation/x86/kernel-stacks @@ -0,0 +1,141 @@ +Kernel stacks on x86-64 bit +--------------------------- + +Most of the text from Keith Owens, hacked by AK + +x86_64 page size (PAGE_SIZE) is 4K. + +Like all other architectures, x86_64 has a kernel stack for every +active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. +These stacks contain useful data as long as a thread is alive or a +zombie. While the thread is in user space the kernel stack is empty +except for the thread_info structure at the bottom. + +In addition to the per thread stacks, there are specialized stacks +associated with each CPU. These stacks are only used while the kernel +is in control on that CPU; when a CPU returns to user space the +specialized stacks contain no useful data. The main CPU stacks are: + +* Interrupt stack. IRQ_STACK_SIZE + + Used for external hardware interrupts. If this is the first external + hardware interrupt (i.e. not a nested hardware interrupt) then the + kernel switches from the current task to the interrupt stack. Like + the split thread and interrupt stacks on i386, this gives more room + for kernel interrupt processing without having to increase the size + of every per thread stack. + + The interrupt stack is also used when processing a softirq. + +Switching to the kernel interrupt stack is done by software based on a +per CPU interrupt nest counter. This is needed because x86-64 "IST" +hardware stacks cannot nest without races. + +x86_64 also has a feature which is not available on i386, the ability +to automatically switch to a new stack for designated events such as +double fault or NMI, which makes it easier to handle these unusual +events on x86_64. This feature is called the Interrupt Stack Table +(IST). There can be up to 7 IST entries per CPU. The IST code is an +index into the Task State Segment (TSS). The IST entries in the TSS +point to dedicated stacks; each stack can be a different size. + +An IST is selected by a non-zero value in the IST field of an +interrupt-gate descriptor. When an interrupt occurs and the hardware +loads such a descriptor, the hardware automatically sets the new stack +pointer based on the IST value, then invokes the interrupt handler. If +the interrupt came from user mode, then the interrupt handler prologue +will switch back to the per-thread stack. If software wants to allow +nested IST interrupts then the handler must adjust the IST values on +entry to and exit from the interrupt handler. (This is occasionally +done, e.g. for debug exceptions.) + +Events with different IST codes (i.e. with different stacks) can be +nested. For example, a debug interrupt can safely be interrupted by an +NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack +pointers on entry to and exit from all IST events, in theory allowing +IST events with the same code to be nested. However in most cases, the +stack size allocated to an IST assumes no nesting for the same code. +If that assumption is ever broken then the stacks will become corrupt. + +The currently assigned IST stacks are :- + +* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 8 - Double Fault Exception (#DF). + + Invoked when handling one exception causes another exception. Happens + when the kernel is very confused (e.g. kernel stack pointer corrupt). + Using a separate stack allows the kernel to recover from it well enough + in many cases to still output an oops. + +* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for non-maskable interrupts (NMI). + + NMI can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for NMI events avoids making + assumptions about the previous state of the kernel stack. + +* DEBUG_STACK. DEBUG_STKSZ + + Used for hardware debug interrupts (interrupt 1) and for software + debug interrupts (INT3). + + When debugging a kernel, debug interrupts (both hardware and + software) can occur at any time. Using IST for these interrupts + avoids making assumptions about the previous state of the kernel + stack. + +* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). + + Used for interrupt 18 - Machine Check Exception (#MC). + + MCE can be delivered at any time, including when the kernel is in the + middle of switching stacks. Using IST for MCE events avoids making + assumptions about the previous state of the kernel stack. + +For more details see the Intel IA32 or AMD AMD64 architecture manuals. + + +Printing backtraces on x86 +-------------------------- + +The question about the '?' preceding function names in an x86 stacktrace +keeps popping up, here's an indepth explanation. It helps if the reader +stares at print_context_stack() and the whole machinery in and around +arch/x86/kernel/dumpstack.c. + +Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: + +We always scan the full kernel stack for return addresses stored on +the kernel stack(s) [*], from stack top to stack bottom, and print out +anything that 'looks like' a kernel text address. + +If it fits into the frame pointer chain, we print it without a question +mark, knowing that it's part of the real backtrace. + +If the address does not fit into our expected frame pointer chain we +still print it, but we print a '?'. It can mean two things: + + - either the address is not part of the call chain: it's just stale + values on the kernel stack, from earlier function calls. This is + the common case. + + - or it is part of the call chain, but the frame pointer was not set + up properly within the function, so we don't recognize it. + +This way we will always print out the real call chain (plus a few more +entries), regardless of whether the frame pointer was set up correctly +or not - but in most cases we'll get the call chain right as well. The +entries printed are strictly in stack order, so you can deduce more +information from that as well. + +The most important property of this method is that we _never_ lose +information: we always strive to print _all_ addresses on the stack(s) +that look like kernel text addresses, so if debug information is wrong, +we still print out the real call chain as well - just with more question +marks than ideal. + +[*] For things like IRQ and IST stacks, we also scan those stacks, in + the right order, and try to cross from one stack into another + reconstructing the call chain. This works most of the time. diff --git a/kernel/Documentation/x86/mtrr.txt b/kernel/Documentation/x86/mtrr.txt index cc071dc33..dc3e70391 100644 --- a/kernel/Documentation/x86/mtrr.txt +++ b/kernel/Documentation/x86/mtrr.txt @@ -1,7 +1,31 @@ MTRR (Memory Type Range Register) control -3 Jun 1999 -Richard Gooch - + +Richard Gooch - 3 Jun 1999 +Luis R. Rodriguez - April 9, 2015 + +=============================================================================== +Phasing out MTRR use + +MTRR use is replaced on modern x86 hardware with PAT. Direct MTRR use by +drivers on Linux is now completely phased out, device drivers should use +arch_phys_wc_add() in combination with ioremap_wc() to make MTRR effective on +non-PAT systems while a no-op but equally effective on PAT enabled systems. + +Even if Linux does not use MTRRs directly, some x86 platform firmware may still +set up MTRRs early before booting the OS. They do this as some platform +firmware may still have implemented access to MTRRs which would be controlled +and handled by the platform firmware directly. An example of platform use of +MTRRs is through the use of SMI handlers, one case could be for fan control, +the platform code would need uncachable access to some of its fan control +registers. Such platform access does not need any Operating System MTRR code in +place other than mtrr_type_lookup() to ensure any OS specific mapping requests +are aligned with platform MTRR setup. If MTRRs are only set up by the platform +firmware code though and the OS does not make any specific MTRR mapping +requests mtrr_type_lookup() should always return MTRR_TYPE_INVALID. + +For details refer to Documentation/x86/pat.txt. + +=============================================================================== On Intel P6 family processors (Pentium Pro, Pentium II and later) the Memory Type Range Registers (MTRRs) may be used to control diff --git a/kernel/Documentation/x86/pat.txt b/kernel/Documentation/x86/pat.txt index cf08c9fff..54944c71b 100644 --- a/kernel/Documentation/x86/pat.txt +++ b/kernel/Documentation/x86/pat.txt @@ -12,7 +12,7 @@ virtual addresses. PAT allows for different types of memory attributes. The most commonly used ones that will be supported at this time are Write-back, Uncached, -Write-combined and Uncached Minus. +Write-combined, Write-through and Uncached Minus. PAT APIs @@ -34,16 +34,23 @@ ioremap | -- | UC- | UC- | | | | | ioremap_cache | -- | WB | WB | | | | | +ioremap_uc | -- | UC | UC | + | | | | ioremap_nocache | -- | UC- | UC- | | | | | ioremap_wc | -- | -- | WC | | | | | +ioremap_wt | -- | -- | WT | + | | | | set_memory_uc | UC- | -- | -- | set_memory_wb | | | | | | | | set_memory_wc | WC | -- | -- | set_memory_wb | | | | | | | | +set_memory_wt | WT | -- | -- | + set_memory_wb | | | | + | | | | pci sysfs resource | -- | -- | UC- | | | | | pci sysfs resource_wc | -- | -- | WC | @@ -102,7 +109,38 @@ wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc() as step 0 above and also track the usage of those pages and use set_memory_wb() before the page is freed to free pool. - +MTRR effects on PAT / non-PAT systems +------------------------------------- + +The following table provides the effects of using write-combining MTRRs when +using ioremap*() calls on x86 for both non-PAT and PAT systems. Ideally +mtrr_add() usage will be phased out in favor of arch_phys_wc_add() which will +be a no-op on PAT enabled systems. The region over which a arch_phys_wc_add() +is made, should already have been ioremapped with WC attributes or PAT entries, +this can be done by using ioremap_wc() / set_memory_wc(). Devices which +combine areas of IO memory desired to remain uncacheable with areas where +write-combining is desirable should consider use of ioremap_uc() followed by +set_memory_wc() to white-list effective write-combined areas. Such use is +nevertheless discouraged as the effective memory type is considered +implementation defined, yet this strategy can be used as last resort on devices +with size-constrained regions where otherwise MTRR write-combining would +otherwise not be effective. + +---------------------------------------------------------------------- +MTRR Non-PAT PAT Linux ioremap value Effective memory type +---------------------------------------------------------------------- + Non-PAT | PAT + PAT + |PCD + ||PWT + ||| +WC 000 WB _PAGE_CACHE_MODE_WB WC | WC +WC 001 WC _PAGE_CACHE_MODE_WC WC* | WC +WC 010 UC- _PAGE_CACHE_MODE_UC_MINUS WC* | UC +WC 011 UC _PAGE_CACHE_MODE_UC UC | UC +---------------------------------------------------------------------- + +(*) denotes implementation defined and is discouraged Notes: @@ -115,8 +153,8 @@ can be more restrictive, in case of any existing aliasing for that address. For example: If there is an existing uncached mapping, a new ioremap_wc can return uncached mapping in place of write-combine requested. -set_memory_[uc|wc] and set_memory_wb should be used in pairs, where driver will -first make a region uc or wc and switch it back to wb after use. +set_memory_[uc|wc|wt] and set_memory_wb should be used in pairs, where driver +will first make a region uc, wc or wt and switch it back to wb after use. Over time writes to /proc/mtrr will be deprecated in favor of using PAT based interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. @@ -124,7 +162,7 @@ interfaces. Users writing to /proc/mtrr are suggested to use above interfaces. Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access types. -Drivers should use set_memory_[uc|wc] to set access type for RAM ranges. +Drivers should use set_memory_[uc|wc|wt] to set access type for RAM ranges. PAT debugging diff --git a/kernel/Documentation/x86/x86_64/boot-options.txt b/kernel/Documentation/x86/x86_64/boot-options.txt index 522347929..68ed3114c 100644 --- a/kernel/Documentation/x86/x86_64/boot-options.txt +++ b/kernel/Documentation/x86/x86_64/boot-options.txt @@ -31,6 +31,9 @@ Machine check (e.g. BIOS or hardware monitoring applications), conflicting with OS's error handling, and you cannot deactivate the agent, then this option will be a help. + mce=no_lmce + Do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. mce=bootlog Enable logging of machine checks left over from booting. Disabled by default on AMD because some BIOS leave bogus ones. diff --git a/kernel/Documentation/x86/x86_64/kernel-stacks b/kernel/Documentation/x86/x86_64/kernel-stacks deleted file mode 100644 index e3c8a49d1..000000000 --- a/kernel/Documentation/x86/x86_64/kernel-stacks +++ /dev/null @@ -1,101 +0,0 @@ -Most of the text from Keith Owens, hacked by AK - -x86_64 page size (PAGE_SIZE) is 4K. - -Like all other architectures, x86_64 has a kernel stack for every -active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. -These stacks contain useful data as long as a thread is alive or a -zombie. While the thread is in user space the kernel stack is empty -except for the thread_info structure at the bottom. - -In addition to the per thread stacks, there are specialized stacks -associated with each CPU. These stacks are only used while the kernel -is in control on that CPU; when a CPU returns to user space the -specialized stacks contain no useful data. The main CPU stacks are: - -* Interrupt stack. IRQSTACKSIZE - - Used for external hardware interrupts. If this is the first external - hardware interrupt (i.e. not a nested hardware interrupt) then the - kernel switches from the current task to the interrupt stack. Like - the split thread and interrupt stacks on i386, this gives more room - for kernel interrupt processing without having to increase the size - of every per thread stack. - - The interrupt stack is also used when processing a softirq. - -Switching to the kernel interrupt stack is done by software based on a -per CPU interrupt nest counter. This is needed because x86-64 "IST" -hardware stacks cannot nest without races. - -x86_64 also has a feature which is not available on i386, the ability -to automatically switch to a new stack for designated events such as -double fault or NMI, which makes it easier to handle these unusual -events on x86_64. This feature is called the Interrupt Stack Table -(IST). There can be up to 7 IST entries per CPU. The IST code is an -index into the Task State Segment (TSS). The IST entries in the TSS -point to dedicated stacks; each stack can be a different size. - -An IST is selected by a non-zero value in the IST field of an -interrupt-gate descriptor. When an interrupt occurs and the hardware -loads such a descriptor, the hardware automatically sets the new stack -pointer based on the IST value, then invokes the interrupt handler. If -the interrupt came from user mode, then the interrupt handler prologue -will switch back to the per-thread stack. If software wants to allow -nested IST interrupts then the handler must adjust the IST values on -entry to and exit from the interrupt handler. (This is occasionally -done, e.g. for debug exceptions.) - -Events with different IST codes (i.e. with different stacks) can be -nested. For example, a debug interrupt can safely be interrupted by an -NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack -pointers on entry to and exit from all IST events, in theory allowing -IST events with the same code to be nested. However in most cases, the -stack size allocated to an IST assumes no nesting for the same code. -If that assumption is ever broken then the stacks will become corrupt. - -The currently assigned IST stacks are :- - -* STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 12 - Stack Fault Exception (#SS). - - This allows the CPU to recover from invalid stack segments. Rarely - happens. - -* DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 8 - Double Fault Exception (#DF). - - Invoked when handling one exception causes another exception. Happens - when the kernel is very confused (e.g. kernel stack pointer corrupt). - Using a separate stack allows the kernel to recover from it well enough - in many cases to still output an oops. - -* NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for non-maskable interrupts (NMI). - - NMI can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for NMI events avoids making - assumptions about the previous state of the kernel stack. - -* DEBUG_STACK. DEBUG_STKSZ - - Used for hardware debug interrupts (interrupt 1) and for software - debug interrupts (INT3). - - When debugging a kernel, debug interrupts (both hardware and - software) can occur at any time. Using IST for these interrupts - avoids making assumptions about the previous state of the kernel - stack. - -* MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). - - Used for interrupt 18 - Machine Check Exception (#MC). - - MCE can be delivered at any time, including when the kernel is in the - middle of switching stacks. Using IST for MCE events avoids making - assumptions about the previous state of the kernel stack. - -For more details see the Intel IA32 or AMD AMD64 architecture manuals. diff --git a/kernel/Documentation/x86/zero-page.txt b/kernel/Documentation/x86/zero-page.txt index 82fbdbc1e..95a4d34af 100644 --- a/kernel/Documentation/x86/zero-page.txt +++ b/kernel/Documentation/x86/zero-page.txt @@ -17,7 +17,8 @@ Offset Proto Name Meaning (struct ist_info) 080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! 090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! -0A0/010 ALL sys_desc_table System description table (struct sys_desc_table) +0A0/010 ALL sys_desc_table System description table (struct sys_desc_table), + OBSOLETE!! 0B0/010 ALL olpc_ofw_header OLPC's OpenFirmware CIF and friends 0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits 0C4/004 ALL ext_ramdisk_size ramdisk_size high 32bits -- cgit 1.2.3-korg