diff options
Diffstat (limited to 'docs/development/requirements/07-annex.rst')
-rw-r--r-- | docs/development/requirements/07-annex.rst | 129 |
1 files changed, 129 insertions, 0 deletions
diff --git a/docs/development/requirements/07-annex.rst b/docs/development/requirements/07-annex.rst new file mode 100644 index 00000000..c3a7899d --- /dev/null +++ b/docs/development/requirements/07-annex.rst @@ -0,0 +1,129 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +.. _nfvi_faults: + +Annex: NFVI Faults +================================================= + +Faults in the listed elements need to be immediately notified to the Consumer in +order to perform an immediate action like live migration or switch to a hot +standby entity. In addition, the Administrator of the host should trigger a +maintenance action to, e.g., reboot the server or replace a defective hardware +element. + +Faults can be of different severity, i.e., critical, warning, or +info. Critical faults require immediate action as a severe degradation of the +system has happened or is expected. Warnings indicate that the system +performance is going down: related actions include closer (e.g. more frequent) +monitoring of that part of the system or preparation for a cold migration to a +backup VM. Info messages do not require any action. We also consider a type +"maintenance", which is no real fault, but may trigger maintenance actions +like a re-boot of the server or replacement of a faulty, but redundant HW. + +Faults can be gathered by, e.g., enabling SNMP and installing some open source +tools to catch and poll SNMP. When using for example Zabbix one can also put an +agent running on the hosts to catch any other fault. In any case of failure, the +Administrator should be notified. The following tables provide a list of high +level faults that are considered within the scope of the Doctor project +requiring immediate action by the Consumer. + +**Compute/Storage** + ++-------------------+----------+------------+-----------------+------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++===================+==========+============+=================+==================+ +| Processor/CPU | Critical | Zabbix | | Switch to hot | +| failure, CPU | | | | standby | +| condition not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Memory failure/ | Critical | Zabbix | | Switch to hot | +| Memory condition | | (IPMI) | | standby | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Network card | Critical | Zabbix/ | | Switch to hot | +| failure, e.g. | | Ceilometer | | standby | +| network adapter | | | | | +| connectivity lost | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Disk crash | Info | RAID | Network storage | Inform OAM | +| | | monitoring | is very | | +| | | | redundant (e.g. | | +| | | | RAID system) | | +| | | | and can | | +| | | | guarantee high | | +| | | | availability | | ++-------------------+----------+------------+-----------------+------------------+ +| Storage | Critical | Zabbix | | Live migration | +| controller | | (IPMI) | | if storage | +| | | | | is still | +| | | | | accessible; | +| | | | | otherwise hot | +| | | | | standby | ++-------------------+----------+------------+-----------------+------------------+ +| PDU/power | Critical | Zabbix/ | | Switch to hot | +| failure, power | | Ceilometer | | standby | +| off, server reset | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Power | Warning | SNMP | | Live migration | +| degration, power | | | | | +| redundancy lost, | | | | | +| power threshold | | | | | +| exceeded | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Chassis problem | Warning | SNMP | | Live migration | +| (e.g. fan | | | | | +| degraded/failed, | | | | | +| chassis power | | | | | +| degraded), CPU | | | | | +| fan problem, | | | | | +| temperature/ | | | | | +| thermal condition | | | | | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to hot | +| | | (IPMI) | link failure | standby | ++-------------------+----------+------------+-----------------+------------------+ +| OS crash (e.g. | Critical | Zabbix | | Switch to hot | +| kernel panic) | | | | standby | ++-------------------+----------+------------+-----------------+------------------+ + +**Hypervisor** + ++----------------+----------+------------+-------------+-------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++================+==========+============+=============+===================+ +| System has | Critical | Zabbix | | Switch to hot | +| restarted | | | | standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch | +| failure | Critical | Ceilometer | | to hot standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM | +| status not | | service | Ceilometer | | +| retrievable | | | unreachable | | +| after certain | | | | | +| period | | | | | ++----------------+----------+------------+-------------+-------------------+ + +**Network** + ++------------------+----------+---------+----------------+---------------------+ +| Fault | Severity | How to | Comment | Immediate action to | +| | | detect? | | recover | ++==================+==========+=========+================+=====================+ +| SDN/OpenFlow | Critical | Ceilo- | | Switch to | +| switch, | | meter | | hot standby | +| controller | | | | or reconfigure | +| degraded/failed | | | | virtual network | +| | | | | topology | ++------------------+----------+---------+----------------+---------------------+ +| Hardware failure | Warning | SNMP | Redundancy of | Live migration if | +| of physical | | | physical | possible otherwise | +| switch/router | | | infrastructure | evacuation | +| | | | is reduced or | | +| | | | no longer | | +| | | | available | | ++------------------+----------+---------+----------------+---------------------+ |