summaryrefslogtreecommitdiffstats
path: root/kernel/Documentation/x86/x86_64/machinecheck
diff options
context:
space:
mode:
authorYunhong Jiang <yunhong.jiang@intel.com>2015-08-04 12:17:53 -0700
committerYunhong Jiang <yunhong.jiang@intel.com>2015-08-04 15:44:42 -0700
commit9ca8dbcc65cfc63d6f5ef3312a33184e1d726e00 (patch)
tree1c9cafbcd35f783a87880a10f85d1a060db1a563 /kernel/Documentation/x86/x86_64/machinecheck
parent98260f3884f4a202f9ca5eabed40b1354c489b29 (diff)
Add the rt linux 4.1.3-rt3 as base
Import the rt linux 4.1.3-rt3 as OPNFV kvm base. It's from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git linux-4.1.y-rt and the base is: commit 0917f823c59692d751951bf5ea699a2d1e2f26a2 Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Sat Jul 25 12:13:34 2015 +0200 Prepare v4.1.3-rt3 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> We lose all the git history this way and it's not good. We should apply another opnfv project repo in future. Change-Id: I87543d81c9df70d99c5001fbdf646b202c19f423 Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Diffstat (limited to 'kernel/Documentation/x86/x86_64/machinecheck')
-rw-r--r--kernel/Documentation/x86/x86_64/machinecheck83
1 files changed, 83 insertions, 0 deletions
diff --git a/kernel/Documentation/x86/x86_64/machinecheck b/kernel/Documentation/x86/x86_64/machinecheck
new file mode 100644
index 000000000..b1fb30273
--- /dev/null
+++ b/kernel/Documentation/x86/x86_64/machinecheck
@@ -0,0 +1,83 @@
+
+Configurable sysfs parameters for the x86-64 machine check code.
+
+Machine checks report internal hardware error conditions detected
+by the CPU. Uncorrected errors typically cause a machine check
+(often with panic), corrected ones cause a machine check log entry.
+
+Machine checks are organized in banks (normally associated with
+a hardware subsystem) and subevents in a bank. The exact meaning
+of the banks and subevent is CPU specific.
+
+mcelog knows how to decode them.
+
+When you see the "Machine check errors logged" message in the system
+log then mcelog should run to collect and decode machine check entries
+from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
+
+Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
+(N = CPU number)
+
+The directory contains some configurable entries:
+
+Entries:
+
+bankNctl
+(N bank number)
+ 64bit Hex bitmask enabling/disabling specific subevents for bank N
+ When a bit in the bitmask is zero then the respective
+ subevent will not be reported.
+ By default all events are enabled.
+ Note that BIOS maintain another mask to disable specific events
+ per bank. This is not visible here
+
+The following entries appear for each CPU, but they are truly shared
+between all CPUs.
+
+check_interval
+ How often to poll for corrected machine check errors, in seconds
+ (Note output is hexademical). Default 5 minutes. When the poller
+ finds MCEs it triggers an exponential speedup (poll more often) on
+ the polling interval. When the poller stops finding MCEs, it
+ triggers an exponential backoff (poll less often) on the polling
+ interval. The check_interval variable is both the initial and
+ maximum polling interval. 0 means no polling for corrected machine
+ check errors (but some corrected errors might be still reported
+ in other ways)
+
+tolerant
+ Tolerance level. When a machine check exception occurs for a non
+ corrected machine check the kernel can take different actions.
+ Since machine check exceptions can happen any time it is sometimes
+ risky for the kernel to kill a process because it defies
+ normal kernel locking rules. The tolerance level configures
+ how hard the kernel tries to recover even at some risk of
+ deadlock. Higher tolerant values trade potentially better uptime
+ with the risk of a crash or even corruption (for tolerant >= 3).
+
+ 0: always panic on uncorrected errors, log corrected errors
+ 1: panic or SIGBUS on uncorrected errors, log corrected errors
+ 2: SIGBUS or log uncorrected errors, log corrected errors
+ 3: never panic or SIGBUS, log all errors (for testing only)
+
+ Default: 1
+
+ Note this only makes a difference if the CPU allows recovery
+ from a machine check exception. Current x86 CPUs generally do not.
+
+trigger
+ Program to run when a machine check event is detected.
+ This is an alternative to running mcelog regularly from cron
+ and allows to detect events faster.
+monarch_timeout
+ How long to wait for the other CPUs to machine check too on a
+ exception. 0 to disable waiting for other CPUs.
+ Unit: us
+
+TBD document entries for AMD threshold interrupt configuration
+
+For more details about the x86 machine check architecture
+see the Intel and AMD architecture manuals from their developer websites.
+
+For more details about the architecture see
+see http://one.firstfloor.org/~andi/mce.pdf