summaryrefslogtreecommitdiffstats
path: root/docs/requirements
diff options
context:
space:
mode:
Diffstat (limited to 'docs/requirements')
-rw-r--r--docs/requirements/02-use_cases.rst2
-rw-r--r--docs/requirements/03-architecture.rst34
-rw-r--r--docs/requirements/04-gaps.rst55
-rw-r--r--docs/requirements/05-implementation.rst117
-rw-r--r--docs/requirements/07-annex.rst2
-rw-r--r--[-rwxr-xr-x]docs/requirements/images/figure1.pngbin977880 -> 79420 bytes
-rw-r--r--[-rwxr-xr-x]docs/requirements/images/figure2.pngbin1043699 -> 82010 bytes
7 files changed, 133 insertions, 77 deletions
diff --git a/docs/requirements/02-use_cases.rst b/docs/requirements/02-use_cases.rst
index 424a3c6e..0a1f6413 100644
--- a/docs/requirements/02-use_cases.rst
+++ b/docs/requirements/02-use_cases.rst
@@ -136,7 +136,7 @@ the same as in the "Fault management using ACT-STBY configuration" use case,
except in this case, the Consumer of a VM/VNF switches to STBY configuration
based on a predicted fault, rather than an occurred fault.
-NVFI Maintenance
+NFVI Maintenance
----------------
VM Retirement
diff --git a/docs/requirements/03-architecture.rst b/docs/requirements/03-architecture.rst
index 8ff5dacf..b7417691 100644
--- a/docs/requirements/03-architecture.rst
+++ b/docs/requirements/03-architecture.rst
@@ -191,11 +191,15 @@ fencing, but there has not been any progress. The general description is
available here:
https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host
-As OpenStack does not cover fencing it is in the responsibility of the Doctor
-project to make sure fencing is done by using tools like pacemaker and by
-calling OpenStack APIs. Only after fencing is done OpenStack resources can be
-marked as down. In case there are gaps in OpenStack projects to have all
-relevant resources marked as down, those gaps need to be identified and fixed.
+OpenStack provides some mechanisms that allow fencing of faulty resources. Some
+are automatically invoked by the platform itself (e.g. Nova disables the
+compute service when libvirtd stops running, preventing new VMs to be scheduled
+to that node), while other mechanisms are consumer trigger-based actions (e.g.
+Neutron port admin-state-up). For other fencing actions not supported by
+OpenStack, the Doctor project may suggest ways to address the gap (e.g. through
+means of resourcing to external tools and orchestration methods), or
+documenting or implementing them upstream.
+
The Doctor Inspector component will be responsible of marking resources down in
the OpenStack and back up if necessary.
@@ -206,18 +210,18 @@ In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by
the VIM, but all recovery actions executed by the VIM and the NFVI will be
instructed and coordinated by the Consumer.
-In a more advanced use case, the VIM shall be able to recover the failed virtual
+In a more advanced use case, the VIM may be able to recover the failed virtual
resources according to a pre-defined behavior for that resource. In principle
this means that the owner of the resource (i.e., its consumer or administrator)
can define which recovery actions shall be taken by the VIM. Examples are a
-restart of the VM, migration/evacuation of the VM, or no action.
+restart of the VM or migration/evacuation of the VM.
High level northbound interface specification
---------------------------------------------
-Fault management
+Fault Management
^^^^^^^^^^^^^^^^
This interface allows the Consumer to subscribe to fault notification from the
@@ -261,7 +265,8 @@ physical resource from 'enabled' to 'going-to-maintenance' and a timeout [#timeo
After receiving the MaintenanceRequest,the VIM decides on the actions to be taken
based on maintenance policies predefined by the affected Consumer(s).
-.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time to empty the physical resources.
+.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time
+ to empty the physical resources.
.. figure:: images/figure5a.png
:name: figure5a
@@ -321,12 +326,13 @@ An example of a high level message flow to cover the failed NFVI maintenance cas
shown in :numref:`figure5c`.
It consists of the following steps:
-5. The Consumer C3 switches to standby configuration (STDBY).
-6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed (steps 6a, 6b).
- The VIM executes the requested actions and sends back a NACK to consumer C2 (step 6d) as the
- migration of the virtual resource(s) is not completed by the given timeout.
+5. The Consumer C3 switches to standby configuration (STBY).
+6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed
+ (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2
+ (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout.
7. The VIM switches the physical resources to "enabled" state.
-8. MaintenanceResponse is sent from VIM to inform the Administrator that the maintenance action cannot start.
+8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action
+ cannot start.
..
diff --git a/docs/requirements/04-gaps.rst b/docs/requirements/04-gaps.rst
index 154f8e43..b8ff7f2e 100644
--- a/docs/requirements/04-gaps.rst
+++ b/docs/requirements/04-gaps.rst
@@ -61,6 +61,13 @@ Immediate Notification
- Fault notifications cannot be received immediately by Ceilometer.
+* Solved by
+
+ + Event Alarm Evaluator:
+ https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html
+ + New OpenStack alarms and notifications project AODH:
+ http://docs.openstack.org/developer/aodh/
+
Maintenance Notification
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -98,7 +105,7 @@ Maintenance Notification
- VIM user cannot receive maintenance notifications.
-* Related blueprints
+* Solved by
+ https://blueprints.launchpad.net/nova/+spec/service-status-notification
@@ -126,6 +133,10 @@ Normalization of data collection models
- Normalized data format does not exist.
+* Solved by
+
+ + Specification in Section :ref:`southbound`.
+
OpenStack
---------
@@ -157,7 +168,7 @@ ________________________________
- Ceilometer seems to be unsuitable for monitoring medium and large scale
NFVI deployments.
-* Related blueprints
+* Solved by
+ Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
higher number of fault events (up to 15 thousand events per second, but
@@ -189,13 +200,14 @@ ___________________________________
- OpenStack Ceilometer does not monitor hardware and software to capture
faults.
- + Gap
+ + Gap
- - Ceilometer is not able to detect and handle all faults listed in the Annex.
+ - Ceilometer is not able to detect and handle all faults listed in the Annex.
-* Related blueprints / workarounds
+* Solved by
- - Use other dedicated monitoring tools like Zabbix or Monasca
+ + Use of dedicated monitoring tools like Zabbix or Monasca.
+ See :ref:`nfvi_faults`.
Nova
^^^^
@@ -218,15 +230,14 @@ ________________________________________
+ To-be
- - There needs to be API to change VM power_State in case host has failed.
- - There needs to be API to change nova-compute state.
+ - The API shall support to change VM power state in case host has failed.
+ - The API shall support to change nova-compute state.
- There could be single API to change different VM states for all VMs
- belonging to specific host.
- - As external system monitoring the infra calls these APIs change can be
- fast and reliable.
- - Correlation actions can be faster and automated as states are reliable.
- - User will be able to read states from OpenStack and trust they are
- correct.
+ belonging to a specific host.
+ - Support external systems that are monitoring the infrastructure and resources
+ that are able to call the API fast and reliable.
+ - Resource states are reliable such that correlation actions can be fast and automated.
+ - User shall be able to read states from OpenStack and trust they are correct.
+ As-is
@@ -240,12 +251,11 @@ ________________________________________
+ Gap
- OpenStack does not change its states fast and reliably enough.
- - There is API missing to have external system to change states and to
- trust the states are then reliable (external system has fenced failed
- host).
+ - The API does not support to have an external system to change states and to
+ trust the states are reliable (external system has fenced failed host).
- User cannot read all the states from OpenStack nor trust they are right.
-* Related blueprints
+* Solved by
+ https://blueprints.launchpad.net/nova/+spec/mark-host-down
+ https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
@@ -309,7 +319,7 @@ _________________
underlying root cause of failure. Knowing the root cause can help filter
out unnecessary and overwhelming alarms.
-* Related blueprints / workarounds
+* Status
+ Monasca as of now lacks this feature, although the community is aware and
working toward supporting it.
@@ -334,7 +344,7 @@ _________________
- Sensor monitoring is very important. It provides operators status
on the state of the physical infrastructure (e.g. temperature, fans).
-* Related blueprints / workarounds
+* Addressed by
+ Monasca can be configured to use third-party monitoring solutions (e.g.
Nagios, Cacti) for retrieving additional data.
@@ -370,7 +380,10 @@ _____________________________
+ Gap
- - Cause of the delay needs to be identified and fixed
+ - Cause of the delay is a periodic evaluation and notification. Periodicity is configured
+ as 30s default value and can be reduced to 5s but not below.
+ https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329
+
..
vim: set tabstop=4 expandtab textwidth=80:
diff --git a/docs/requirements/05-implementation.rst b/docs/requirements/05-implementation.rst
index 4c89fdf5..84979772 100644
--- a/docs/requirements/05-implementation.rst
+++ b/docs/requirements/05-implementation.rst
@@ -672,47 +672,81 @@ and correlated alarms. Instead the AODH alarm class has attributes for actions,
rules and user and project id.
-+------------------------+------------------------+------------------------+
-| ETSI NFV Alarm Type | OPNFV Doctor Req Spec | AODH Alarm Type |
-+========================+========================+========================+
-| AlarmId | FaultId | Alarm Id |
-+------------------------+------------------------+------------------------+
-| managedObjectId | virtualResourceId | (N/A) |
-+------------------------+------------------------+------------------------+
-| \- | \- | User_Id, Project_Id |
-+------------------------+------------------------+------------------------+
-| alarmRaisedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmChangedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmClearedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmState: | virtualResourceState | State: ok, alarm, |
-| New, Updated, Cleared | (e.g. normal, | insufficient data |
-| | maintenance, down, | |
-| | error) | |
-+------------------------+------------------------+------------------------+
-| vrPerceivedSeverity: | Severity (Integer) | Severity: low, |
-| Critical, Major, Minor,| | moderate, critical |
-| Warning, Indeterminate,| | |
-| Cleared | | |
-+------------------------+------------------------+------------------------+
-| eventTime (unclear?) | EventTime | (N/A) |
-+------------------------+------------------------+------------------------+
-| faultType | FaultType | type |
-+------------------------+------------------------+------------------------+
-| probableCause | ProbableCause | description |
-+------------------------+------------------------+------------------------+
-| isRootCause | IsRootCause | \- |
-+------------------------+------------------------+------------------------+
-| correlatedAlarmId | CorrelatedFaultId | \- |
-+------------------------+------------------------+------------------------+
-| faultDetails | FaultDetails | \- |
-+------------------------+------------------------+------------------------+
-| \- | \- | actions, rule, time |
-| | | constraints |
-+------------------------+------------------------+------------------------+
-
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations |
+| | Requirement Specs | Notification | | |
++========================+========================+=====================+=============================================+=======================================+
+| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- |
+| | | | is part of the AODH reason parameter. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May |
+| | | | | be derived (e.g. in a shimlayer) from |
+| | | | | the AODH alarm history. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| eventTime | \- | \- | Timestamp when alarm was first observed by | see above |
+| | | | the Monitor. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. |
+| | | | | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. |
+| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be |
+| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be |
+| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. |
+| Minor, Warning, | | moderate, critical | | |
+| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: |
+| | | | | * update OPNFV Doctor specification |
+| | | | | to *Enum* |
+| | | | | |
+| | | | | **perceivedSeverity=Indetermined**: |
+| | | | | * remove value *Indetermined* in |
+| | | | | IFA and map undefined values to |
+| | | | | “minor” severity, or |
+| | | | | * add value *indetermined* in AODH |
+| | | | | and make it the default value. |
+| | | | | |
+| | | | | **perceivedSeverity=Cleared**: |
+| | | | | * remove value *Cleared* in IFA as |
+| | | | | the information about a cleared |
+| | | | | alarm alarm can be derived from |
+| | | | | the alarm state parameter, or |
+| | | | | * add value *cleared* in AODH and |
+| | | | | set a rule that the severity is |
+| | | | | “cleared” when the state is *ok*. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a |
+| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, |
+| | | | format. | "compute.cpu.failure". |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| N/A | N/A | type = "event" | Type of the notification. For fault | \- |
+| | | | notifications the type in AODH is “event”. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) |
+| | | | | based on Vitrage topology awareness / |
+| | | | | root-cause-analysis. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above |
+| | | | root cause of other faults. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will |
+| | | | | be specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | action, previous | Additional AODH alarm related parameters. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
Table: Comparison of alarm attributes
@@ -728,6 +762,7 @@ Other areas that need alignment is the so called alarm state in NFV. Here we mus
however consider what can be attributes of the notification vs. what should be a
property of the alarm instance. This will be analyzed later.
+.. _southbound:
Detailed southbound interface specification
-------------------------------------------
diff --git a/docs/requirements/07-annex.rst b/docs/requirements/07-annex.rst
index 8cb19612..2ebba0d8 100644
--- a/docs/requirements/07-annex.rst
+++ b/docs/requirements/07-annex.rst
@@ -1,6 +1,8 @@
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
+.. _nfvi_faults:
+
Annex: NFVI Faults
=================================================
diff --git a/docs/requirements/images/figure1.png b/docs/requirements/images/figure1.png
index dacf0dd4..267ddddc 100755..100644
--- a/docs/requirements/images/figure1.png
+++ b/docs/requirements/images/figure1.png
Binary files differ
diff --git a/docs/requirements/images/figure2.png b/docs/requirements/images/figure2.png
index 3c8a2bf1..9a3b166d 100755..100644
--- a/docs/requirements/images/figure2.png
+++ b/docs/requirements/images/figure2.png
Binary files differ