summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--requirements/03-architecture.rst61
-rw-r--r--requirements/04-gaps.rst5
-rw-r--r--requirements/07-annex.rst64
-rw-r--r--requirements/index.rst1
4 files changed, 68 insertions, 63 deletions
diff --git a/requirements/03-architecture.rst b/requirements/03-architecture.rst
index 58fa7968..14055485 100644
--- a/requirements/03-architecture.rst
+++ b/requirements/03-architecture.rst
@@ -284,65 +284,6 @@ It consists of the following steps:
8. The Administrator is coordinating and executing the maintenance
operation/work on the NFVI. Note: this step is out of scope of Doctor.
-Faults
-------
-
-Faults in the listed elements need to be immediately notified to the Consumer in
-order to perform an immediate action like live migration or switch to a hot
-standby entity. In addition, the Administrator of the host should trigger a
-maintenance action to, e.g., reboot the server or replace a defective hardware
-element.
-
-Faults can be of different severity, i.e., critical, warning, or
-info. Critical faults require immediate action as a severe degradation of the
-system has happened or is expected. Warnings indicate that the system
-performance is going down: related actions include closer (e.g. more frequent)
-monitoring of that part of the system or preparation for a cold migration to a
-backup VM. Info messages do not require any action. We also consider a type
-"maintenance", which is no real fault, but may trigger maintenance actions
-like a re-boot of the server or replacement of a faulty, but redundant HW.
-
-Faults can be gathered by, e.g., enabling SNMP and installing some open source
-tools to catch and poll SNMP. When using for example Zabbix one can also put an
-agent running on the hosts to catch any other fault. In any case of failure, the
-Administrator should be notified. Table 1 provides a list of high level faults
-that are considered within the scope of the Doctor project requiring immediate
-action by the Consumer.
-
-
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Service | Fault | Severity | How to detect? | Comment | Action to recover |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Compute Hardware | Processor/CPU failure, CPU condition not ok | Critical | Zabbix | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Memory failure/Memory condition not ok | Critical | Zabbix (IPMI) | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Network card failure, e.g. network adapter connectivity lost | Critical | Zabbix/Ceilometer | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Disk crash | Info | RAID monitoring | Network storage is very redundant (e.g. RAID system) and can guarantee high availability | Inform OAM |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Storage controller | Critical | Zabbix (IPMI) | | Live migration if storage is still accessible; otherwise hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | PDU/power failure, power off, server reset | Critical | Zabbix/Ceilometer | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Power degradation, power redundancy lost, power threshold exceeded | Warning | SNMP | | Live migration |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Chassis problem (.e.g fan degraded/failed, chassis power degraded), CPU fan problem, temperature/thermal condition not ok | Warning | SNMP | | Live migration |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Mainboard failure | Critical | Zabbix (IPMI) | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | OS crash (e.g. kernel panic) | Critical | Zabbix | | Switch to hot standby |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Hypervisor | System has restarted | Critical | Zabbix | | Switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Hypervisor failure | Warning/Critical | Zabbix/Ceilometer | | Evacuation/switch to hot standby |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Zabbix/Ceilometer is unreachable | Warning | ? | | Live migration |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| Network | SDN/OpenFlow switch, controller degraded/failed | Critical | ? | | Switch to hot standby or reconfigure virtual network topology |
-+ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-| | Hardware failure of physical switch/router | Warning | SNMP | Redundancy of physical infrastructure is reduced or no longer available | Live migration if possible, otherwise evacuation |
-+------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
-
..
vim: set tabstop=4 expandtab textwidth=80:
+
diff --git a/requirements/04-gaps.rst b/requirements/04-gaps.rst
index a5e37fd4..67ef86c4 100644
--- a/requirements/04-gaps.rst
+++ b/requirements/04-gaps.rst
@@ -176,7 +176,7 @@ ___________________________________
handle faults on them by Ceilometer.
- OpenStack may have monitoring functionality in itself and can be
integrated with third party monitoring tools.
- - OpenStack need to be able to detect the faults listed in Section 3.5.
+ - OpenStack need to be able to detect the faults listed in the Annex.
+ As-is
@@ -188,8 +188,7 @@ ___________________________________
+ Gap
- - Ceilometer is not able to detect and handle all faults listed in Section
- 3.5.
+ - Ceilometer is not able to detect and handle all faults listed in the Annex.
* Related blueprints / workarounds
diff --git a/requirements/07-annex.rst b/requirements/07-annex.rst
new file mode 100644
index 00000000..dbe41bd1
--- /dev/null
+++ b/requirements/07-annex.rst
@@ -0,0 +1,64 @@
+Annex: NFVI Faults
+=================================================
+
+Faults in the listed elements need to be immediately notified to the Consumer in
+order to perform an immediate action like live migration or switch to a hot
+standby entity. In addition, the Administrator of the host should trigger a
+maintenance action to, e.g., reboot the server or replace a defective hardware
+element.
+
+Faults can be of different severity, i.e., critical, warning, or
+info. Critical faults require immediate action as a severe degradation of the
+system has happened or is expected. Warnings indicate that the system
+performance is going down: related actions include closer (e.g. more frequent)
+monitoring of that part of the system or preparation for a cold migration to a
+backup VM. Info messages do not require any action. We also consider a type
+"maintenance", which is no real fault, but may trigger maintenance actions
+like a re-boot of the server or replacement of a faulty, but redundant HW.
+
+Faults can be gathered by, e.g., enabling SNMP and installing some open source
+tools to catch and poll SNMP. When using for example Zabbix one can also put an
+agent running on the hosts to catch any other fault. In any case of failure, the
+Administrator should be notified. The following table provides a list of high
+level faults that are considered within the scope of the Doctor project
+requiring immediate action by the Consumer.
+
+
+
++------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| Service | Fault | Severity | How to detect? | Comment | Action to recover |
++------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| Compute Hardware | Processor/CPU failure, CPU condition not ok | Critical | Zabbix | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Memory failure/Memory condition not ok | Critical | Zabbix (IPMI) | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Network card failure, e.g. network adapter connectivity lost | Critical | Zabbix/Ceilometer | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Disk crash | Info | RAID monitoring | Network storage is very redundant (e.g. RAID system) and can guarantee high availability | Inform OAM |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Storage controller | Critical | Zabbix (IPMI) | | Live migration if storage is still accessible; otherwise hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | PDU/power failure, power off, server reset | Critical | Zabbix/Ceilometer | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Power degradation, power redundancy lost, power threshold exceeded | Warning | SNMP | | Live migration |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Chassis problem (.e.g fan degraded/failed, chassis power degraded), CPU fan problem, temperature/thermal condition not ok | Warning | SNMP | | Live migration |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Mainboard failure | Critical | Zabbix (IPMI) | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | OS crash (e.g. kernel panic) | Critical | Zabbix | | Switch to hot standby |
++------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| Hypervisor | System has restarted | Critical | Zabbix | | Switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Hypervisor failure | Warning/Critical | Zabbix/Ceilometer | | Evacuation/switch to hot standby |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Zabbix/Ceilometer is unreachable | Warning | ? | | Live migration |
++------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| Network | SDN/OpenFlow switch, controller degraded/failed | Critical | ? | | Switch to hot standby or reconfigure virtual network topology |
++ +---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+| | Hardware failure of physical switch/router | Warning | SNMP | Redundancy of physical infrastructure is reduced or no longer available | Live migration if possible, otherwise evacuation |
++------------------+---------------------------------------------------------------------------------------------------------------------------+------------------+-------------------+------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
+
+..
+ vim: set tabstop=4 expandtab textwidth=80:
+
diff --git a/requirements/index.rst b/requirements/index.rst
index 8495365d..61046c3f 100644
--- a/requirements/index.rst
+++ b/requirements/index.rst
@@ -59,6 +59,7 @@ Doctor: Fault Management and Maintenance
04-gaps.rst
05-implementation.rst
06-summary.rst
+ 07-annex.rst
.. include::
99-references.rst