summaryrefslogtreecommitdiffstats
path: root/docs/development/requirements/07-annex.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/development/requirements/07-annex.rst')
-rw-r--r--docs/development/requirements/07-annex.rst129
1 files changed, 129 insertions, 0 deletions
diff --git a/docs/development/requirements/07-annex.rst b/docs/development/requirements/07-annex.rst
new file mode 100644
index 00000000..c3a7899d
--- /dev/null
+++ b/docs/development/requirements/07-annex.rst
@@ -0,0 +1,129 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+.. _nfvi_faults:
+
+Annex: NFVI Faults
+=================================================
+
+Faults in the listed elements need to be immediately notified to the Consumer in
+order to perform an immediate action like live migration or switch to a hot
+standby entity. In addition, the Administrator of the host should trigger a
+maintenance action to, e.g., reboot the server or replace a defective hardware
+element.
+
+Faults can be of different severity, i.e., critical, warning, or
+info. Critical faults require immediate action as a severe degradation of the
+system has happened or is expected. Warnings indicate that the system
+performance is going down: related actions include closer (e.g. more frequent)
+monitoring of that part of the system or preparation for a cold migration to a
+backup VM. Info messages do not require any action. We also consider a type
+"maintenance", which is no real fault, but may trigger maintenance actions
+like a re-boot of the server or replacement of a faulty, but redundant HW.
+
+Faults can be gathered by, e.g., enabling SNMP and installing some open source
+tools to catch and poll SNMP. When using for example Zabbix one can also put an
+agent running on the hosts to catch any other fault. In any case of failure, the
+Administrator should be notified. The following tables provide a list of high
+level faults that are considered within the scope of the Doctor project
+requiring immediate action by the Consumer.
+
+**Compute/Storage**
+
++-------------------+----------+------------+-----------------+------------------+
+| Fault | Severity | How to | Comment | Immediate action |
+| | | detect? | | to recover |
++===================+==========+============+=================+==================+
+| Processor/CPU | Critical | Zabbix | | Switch to hot |
+| failure, CPU | | | | standby |
+| condition not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Memory failure/ | Critical | Zabbix | | Switch to hot |
+| Memory condition | | (IPMI) | | standby |
+| not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Network card | Critical | Zabbix/ | | Switch to hot |
+| failure, e.g. | | Ceilometer | | standby |
+| network adapter | | | | |
+| connectivity lost | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Disk crash | Info | RAID | Network storage | Inform OAM |
+| | | monitoring | is very | |
+| | | | redundant (e.g. | |
+| | | | RAID system) | |
+| | | | and can | |
+| | | | guarantee high | |
+| | | | availability | |
++-------------------+----------+------------+-----------------+------------------+
+| Storage | Critical | Zabbix | | Live migration |
+| controller | | (IPMI) | | if storage |
+| | | | | is still |
+| | | | | accessible; |
+| | | | | otherwise hot |
+| | | | | standby |
++-------------------+----------+------------+-----------------+------------------+
+| PDU/power | Critical | Zabbix/ | | Switch to hot |
+| failure, power | | Ceilometer | | standby |
+| off, server reset | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Power | Warning | SNMP | | Live migration |
+| degration, power | | | | |
+| redundancy lost, | | | | |
+| power threshold | | | | |
+| exceeded | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Chassis problem | Warning | SNMP | | Live migration |
+| (e.g. fan | | | | |
+| degraded/failed, | | | | |
+| chassis power | | | | |
+| degraded), CPU | | | | |
+| fan problem, | | | | |
+| temperature/ | | | | |
+| thermal condition | | | | |
+| not ok | | | | |
++-------------------+----------+------------+-----------------+------------------+
+| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to hot |
+| | | (IPMI) | link failure | standby |
++-------------------+----------+------------+-----------------+------------------+
+| OS crash (e.g. | Critical | Zabbix | | Switch to hot |
+| kernel panic) | | | | standby |
++-------------------+----------+------------+-----------------+------------------+
+
+**Hypervisor**
+
++----------------+----------+------------+-------------+-------------------+
+| Fault | Severity | How to | Comment | Immediate action |
+| | | detect? | | to recover |
++================+==========+============+=============+===================+
+| System has | Critical | Zabbix | | Switch to hot |
+| restarted | | | | standby |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch |
+| failure | Critical | Ceilometer | | to hot standby |
++----------------+----------+------------+-------------+-------------------+
+| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM |
+| status not | | service | Ceilometer | |
+| retrievable | | | unreachable | |
+| after certain | | | | |
+| period | | | | |
++----------------+----------+------------+-------------+-------------------+
+
+**Network**
+
++------------------+----------+---------+----------------+---------------------+
+| Fault | Severity | How to | Comment | Immediate action to |
+| | | detect? | | recover |
++==================+==========+=========+================+=====================+
+| SDN/OpenFlow | Critical | Ceilo- | | Switch to |
+| switch, | | meter | | hot standby |
+| controller | | | | or reconfigure |
+| degraded/failed | | | | virtual network |
+| | | | | topology |
++------------------+----------+---------+----------------+---------------------+
+| Hardware failure | Warning | SNMP | Redundancy of | Live migration if |
+| of physical | | | physical | possible otherwise |
+| switch/router | | | infrastructure | evacuation |
+| | | | is reduced or | |
+| | | | no longer | |
+| | | | available | |
++------------------+----------+---------+----------------+---------------------+