Merge "Update docs structure according to new guidelines in https://wiki.opnfv.org/display/DOC"

author: Ryota Mibu <r-mibu@cq.jp.nec.com> 2017-02-17 04:36:05 +0000
committer: Gerrit Code Review <gerrit@opnfv.org> 2017-02-17 04:36:05 +0000
commit: f3ab498aaddb27f6f598a84e2dbe0203ced6d666 (patch)
tree: fc7b2be2681db87adc1eb935e6fdcc93a8bc1645 /docs/development/requirements/07-annex.rst
parent: 5d9c24fd28bcc02243306a8c96d0c68809523343 (diff)
parent: d0b22e1d856cf8f78e152dfb6c150e001e03dd52 (diff)
1 files changed, 129 insertions, 0 deletions
diff --git a/docs/development/requirements/07-annex.rst b/docs/development/requirements/07-annex.rst
new file mode 100644
index 00000000..c3a7899d
--- /dev/null
+++ b/docs/development/requirements/07-annex.rst
@@ -0,0 +1,129 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+.. _nfvi_faults:
+
+Annex: NFVI Faults
+=================================================
+
+Faults in the listed elements need to be immediately notified to the Consumer in
+order to perform an immediate action like live migration or switch to a hot
+standby entity. In addition, the Administrator of the host should trigger a
+maintenance action to, e.g., reboot the server or replace a defective hardware
+element.
+
+Faults can be of different severity, i.e., critical, warning, or
+info. Critical faults require immediate action as a severe degradation of the
+system has happened or is expected. Warnings indicate that the system
+performance is going down: related actions include closer (e.g. more frequent)
+monitoring of that part of the system or preparation for a cold migration to a
+backup VM. Info messages do not require any action. We also consider a type
+"maintenance", which is no real fault, but may trigger maintenance actions
+like a re-boot of the server or replacement of a faulty, but redundant HW.
+
+Faults can be gathered by, e.g., enabling SNMP and installing some open source
+tools to catch and poll SNMP. When using for example Zabbix one can also put an
+agent running on the hosts to catch any other fault. In any case of failure, the
+Administrator should be notified. The following tables provide a list of high
+level faults that are considered within the scope of the Doctor project
+requiring immediate action by the Consumer.
+
+**Compute/Storage**
+
++-------------------+----------+------------+-----------------+------------------+
+| Fault             | Severity | How to     | Comment         | Immediate action |
+|                   |          | detect?    |                 | to recover       |
++===================+==========+============+=================+==================+
+| Processor/CPU     | Critical | Zabbix     |                 | Switch to hot    |
+| failure, CPU      |          |            |                 | standby          |
+| condition not ok  |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/   | Critical | Zabbix     |                 | Switch to hot    |
+| Memory condition  |          | (IPMI)     |                 | standby          |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Network card      | Critical | Zabbix/    |                 | Switch to hot    |
+| failure, e.g.     |          | Ceilometer |                 | standby          |
+| network adapter   |          |            |                 |                  |
+| connectivity lost |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash        | Info     | RAID       | Network storage | Inform OAM       |
+|                   |          | monitoring | is very         |                  |
+|                   |          |            | redundant (e.g. |                  |
+|                   |          |            | RAID system)    |                  |
+|                   |          |            | and can         |                  |
+|                   |          |            | guarantee high  |                  |
+|                   |          |            | availability    |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Storage           | Critical | Zabbix     |                 | Live migration   |
+| controller        |          | (IPMI)     |                 | if storage       |
+|                   |          |            |                 | is still         |
+|                   |          |            |                 | accessible;      |
+|                   |          |            |                 | otherwise hot    |
+|                   |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power         | Critical | Zabbix/    |                 | Switch to hot    |
+| failure, power    |          | Ceilometer |                 | standby          |
+| off, server reset |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Power             | Warning  | SNMP       |                 | Live migration   |
+| degration, power  |          |            |                 |                  |
+| redundancy lost,  |          |            |                 |                  |
+| power threshold   |          |            |                 |                  |
+| exceeded          |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem   | Warning  | SNMP       |                 | Live migration   |
+| (e.g. fan         |          |            |                 |                  |
+| degraded/failed,  |          |            |                 |                  |
+| chassis power     |          |            |                 |                  |
+| degraded), CPU    |          |            |                 |                  |
+| fan problem,      |          |            |                 |                  |
+| temperature/      |          |            |                 |                  |
+| thermal condition |          |            |                 |                  |
+| not ok            |          |            |                 |                  |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix     | e.g. PCIe, SAS  | Switch to hot    |
+|                   |          | (IPMI)     | link failure    | standby          |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g.    | Critical | Zabbix     |                 | Switch to hot    |
+| kernel panic)     |          |            |                 | standby          |
++-------------------+----------+------------+-----------------+------------------+
+
+**Hypervisor**
+
++----------------+----------+------------+-------------+-------------------+
+| Fault          | Severity | How to     | Comment     | Immediate action  |
+|                |          | detect?    |             | to recover        |
++================+==========+============+=============+===================+
+| System has     | Critical | Zabbix     |             | Switch to hot     |
+| restarted      |          |            |             | standby           |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning/ | Zabbix/    |             | Evacuation/switch |
+| failure        | Critical | Ceilometer |             | to hot standby    |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor     | Warning  | Alarming   | Zabbix/     | Rebuild VM        |
+| status not     |          | service    | Ceilometer  |                   |
+| retrievable    |          |            | unreachable |                   |
+| after certain  |          |            |             |                   |
+| period         |          |            |             |                   |
++----------------+----------+------------+-------------+-------------------+
+
+**Network**
+
++------------------+----------+---------+----------------+---------------------+
+| Fault            | Severity | How to  | Comment        | Immediate action to |
+|                  |          | detect? |                | recover             |
++==================+==========+=========+================+=====================+
+| SDN/OpenFlow     | Critical | Ceilo-  |                | Switch to           |
+| switch,          |          | meter   |                | hot standby         |
+| controller       |          |         |                | or reconfigure      |
+| degraded/failed  |          |         |                | virtual network     |
+|                  |          |         |                | topology            |
++------------------+----------+---------+----------------+---------------------+
+| Hardware failure | Warning  | SNMP    | Redundancy of  | Live migration if   |
+| of physical      |          |         | physical       | possible otherwise  |
+| switch/router    |          |         | infrastructure | evacuation          |
+|                  |          |         | is reduced or  |                     |
+|                  |          |         | no longer      |                     |
+|                  |          |         | available      |                     |
++------------------+----------+---------+----------------+---------------------+
author	Ryota Mibu <r-mibu@cq.jp.nec.com>	2017-02-17 04:36:05 +0000
committer	Gerrit Code Review <gerrit@opnfv.org>	2017-02-17 04:36:05 +0000
commit	f3ab498aaddb27f6f598a84e2dbe0203ced6d666 (patch)
tree	fc7b2be2681db87adc1eb935e6fdcc93a8bc1645 /docs/development/requirements/07-annex.rst
parent	5d9c24fd28bcc02243306a8c96d0c68809523343 (diff)
parent	d0b22e1d856cf8f78e152dfb6c150e001e03dd52 (diff)