summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/design/index.rst2
-rw-r--r--docs/design/inspector-design-guideline.rst46
-rw-r--r--docs/design/performance-profiler.rst118
-rw-r--r--docs/requirements/02-use_cases.rst2
-rw-r--r--docs/requirements/03-architecture.rst34
-rw-r--r--docs/requirements/04-gaps.rst55
-rw-r--r--docs/requirements/05-implementation.rst117
-rw-r--r--docs/requirements/07-annex.rst2
-rw-r--r--[-rwxr-xr-x]docs/requirements/images/figure1.pngbin977880 -> 79420 bytes
-rw-r--r--[-rwxr-xr-x]docs/requirements/images/figure2.pngbin1043699 -> 82010 bytes
10 files changed, 299 insertions, 77 deletions
diff --git a/docs/design/index.rst b/docs/design/index.rst
index 4efbef17..963002a0 100644
--- a/docs/design/index.rst
+++ b/docs/design/index.rst
@@ -23,3 +23,5 @@ See also https://wiki.opnfv.org/requirements_projects .
notification-alarm-evaluator.rst
rfe-port-status-update.rst
port-data-plane-status.rst
+ inspector-design-guideline.rst
+ performance-profiler.rst
diff --git a/docs/design/inspector-design-guideline.rst b/docs/design/inspector-design-guideline.rst
new file mode 100644
index 00000000..4add8c0f
--- /dev/null
+++ b/docs/design/inspector-design-guideline.rst
@@ -0,0 +1,46 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+==========================
+Inspector Design Guideline
+==========================
+
+.. NOTE::
+ This is spec draft of design guideline for inspector component.
+ JIRA ticket to track the update and collect comments: `DOCTOR-73`_.
+
+This document summarize the best practise in designing a high performance
+inspector to meet the requirements in `OPNFV Doctor project`_.
+
+Problem Description
+===================
+
+Some pitfalls has be detected during the development of sample inspector, e.g.
+we suffered a significant `performance degrading in listing VMs in a host`_.
+
+A `patch set for caching the list`_ has been committed to solve issue. When a
+new inspector is integrated, it would be nice to have an evaluation of existing
+design and give recommendations for improvements.
+
+This document can be treated as a source of related blueprints in inspector
+projects.
+
+Guidelines
+==========
+
+Host specific VMs list
+----------------------
+
+TBD, see `DOCTOR-76`_.
+
+Parallel execution
+------------------
+
+TBD, see `discussion in mailing list`_.
+
+.. _DOCTOR-73: https://jira.opnfv.org/browse/DOCTOR-73
+.. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
+.. _performance degrading in listing VMs in a host: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012591.html
+.. _patch set for caching the list: https://gerrit.opnfv.org/gerrit/#/c/20877/
+.. _DOCTOR-76: https://jira.opnfv.org/browse/DOCTOR-76
+.. _discussion in mailing list: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-October/013036.html
diff --git a/docs/design/performance-profiler.rst b/docs/design/performance-profiler.rst
new file mode 100644
index 00000000..f834a915
--- /dev/null
+++ b/docs/design/performance-profiler.rst
@@ -0,0 +1,118 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+
+====================
+Performance Profiler
+====================
+
+https://goo.gl/98Osig
+
+This blueprint proposes to create a performance profiler for doctor scenarios.
+
+Problem Description
+===================
+
+In the verification job for notification time, we have encountered some
+performance issues, such as
+
+1. In environment deployed by APEX, it meets the criteria while in the one by
+Fuel, the performance is much more poor.
+2. Signification performance degradation was spotted when we increase the total
+number of VMs
+
+It takes time to dig the log and analyse the reason. People have to collect
+timestamp at each checkpoints manually to find out the bottleneck. A performance
+profiler will make this process automatic.
+
+Proposed Change
+===============
+
+Current Doctor scenario covers the inspector and notifier in the whole fault
+management cycle::
+
+ start end
+ + + + + + +
+ | | | | | |
+ |monitor|inspector|notifier|manager|controller|
+ +------>+ | | | |
+ occurred +-------->+ | | |
+ | detected +------->+ | |
+ | | identified +-------+ |
+ | | notified +--------->+
+ | | | processed resolved
+ | | | |
+ | +<-----doctor----->+ |
+ | |
+ | |
+ +<---------------fault management------------>+
+
+The notification time can be split into several parts and visualized as a
+timeline::
+
+ start end
+ 0----5---10---15---20---25---30---35---40---45--> (x 10ms)
+ + + + + + + + + + + +
+ 0-hostdown | | | | | | | | |
+ +--->+ | | | | | | | | |
+ | 1-raw failure | | | | | | |
+ | +-->+ | | | | | | | |
+ | | 2-found affected | | | | |
+ | | +-->+ | | | | | | |
+ | | 3-marked host down| | | | |
+ | | +-->+ | | | | | |
+ | | 4-set VM error| | | | |
+ | | +--->+ | | | | |
+ | | | 5-notified VM error | |
+ | | | +----->| | | | |
+ | | | | 6-transformed event
+ | | | | +-->+ | | |
+ | | | | | 7-evaluated event
+ | | | | | +-->+ | |
+ | | | | | 8-fired alarm
+ | | | | | +-->+ |
+ | | | | | 9-received alarm
+ | | | | | +-->+
+ sample | sample | | | |10-handled alarm
+ monitor| inspector |nova| c/m | aodh |
+ | |
+ +<-----------------doctor--------------->+
+
+Note: c/m = ceilometer
+
+And a table of components sorted by time cost from most to least
+
++----------+---------+----------+
+|Component |Time Cost|Percentage|
++==========+=========+==========+
+|inspector |160ms | 40% |
++----------+---------+----------+
+|aodh |110ms | 30% |
++----------+---------+----------+
+|monitor |50ms | 14% |
++----------+---------+----------+
+|... | | |
++----------+---------+----------+
+|... | | |
++----------+---------+----------+
+
+Note: data in the table is for demonstration only, not actual measurement
+
+Timestamps can be collected from various sources
+
+1. log files
+2. trace point in code
+
+The performance profiler will be integrated into the verification job to provide
+detail result of the test. It can also be deployed independently to diagnose
+performance issue in specified environment.
+
+Working Items
+=============
+
+1. PoC with limited checkpoints
+2. Integration with verification job
+3. Collect timestamp at all checkpoints
+4. Display the profiling result in console
+5. Report the profiling result to test database
+6. Independent package which can be installed to specified environment
diff --git a/docs/requirements/02-use_cases.rst b/docs/requirements/02-use_cases.rst
index 424a3c6e..0a1f6413 100644
--- a/docs/requirements/02-use_cases.rst
+++ b/docs/requirements/02-use_cases.rst
@@ -136,7 +136,7 @@ the same as in the "Fault management using ACT-STBY configuration" use case,
except in this case, the Consumer of a VM/VNF switches to STBY configuration
based on a predicted fault, rather than an occurred fault.
-NVFI Maintenance
+NFVI Maintenance
----------------
VM Retirement
diff --git a/docs/requirements/03-architecture.rst b/docs/requirements/03-architecture.rst
index 8ff5dacf..b7417691 100644
--- a/docs/requirements/03-architecture.rst
+++ b/docs/requirements/03-architecture.rst
@@ -191,11 +191,15 @@ fencing, but there has not been any progress. The general description is
available here:
https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host
-As OpenStack does not cover fencing it is in the responsibility of the Doctor
-project to make sure fencing is done by using tools like pacemaker and by
-calling OpenStack APIs. Only after fencing is done OpenStack resources can be
-marked as down. In case there are gaps in OpenStack projects to have all
-relevant resources marked as down, those gaps need to be identified and fixed.
+OpenStack provides some mechanisms that allow fencing of faulty resources. Some
+are automatically invoked by the platform itself (e.g. Nova disables the
+compute service when libvirtd stops running, preventing new VMs to be scheduled
+to that node), while other mechanisms are consumer trigger-based actions (e.g.
+Neutron port admin-state-up). For other fencing actions not supported by
+OpenStack, the Doctor project may suggest ways to address the gap (e.g. through
+means of resourcing to external tools and orchestration methods), or
+documenting or implementing them upstream.
+
The Doctor Inspector component will be responsible of marking resources down in
the OpenStack and back up if necessary.
@@ -206,18 +210,18 @@ In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by
the VIM, but all recovery actions executed by the VIM and the NFVI will be
instructed and coordinated by the Consumer.
-In a more advanced use case, the VIM shall be able to recover the failed virtual
+In a more advanced use case, the VIM may be able to recover the failed virtual
resources according to a pre-defined behavior for that resource. In principle
this means that the owner of the resource (i.e., its consumer or administrator)
can define which recovery actions shall be taken by the VIM. Examples are a
-restart of the VM, migration/evacuation of the VM, or no action.
+restart of the VM or migration/evacuation of the VM.
High level northbound interface specification
---------------------------------------------
-Fault management
+Fault Management
^^^^^^^^^^^^^^^^
This interface allows the Consumer to subscribe to fault notification from the
@@ -261,7 +265,8 @@ physical resource from 'enabled' to 'going-to-maintenance' and a timeout [#timeo
After receiving the MaintenanceRequest,the VIM decides on the actions to be taken
based on maintenance policies predefined by the affected Consumer(s).
-.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time to empty the physical resources.
+.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time
+ to empty the physical resources.
.. figure:: images/figure5a.png
:name: figure5a
@@ -321,12 +326,13 @@ An example of a high level message flow to cover the failed NFVI maintenance cas
shown in :numref:`figure5c`.
It consists of the following steps:
-5. The Consumer C3 switches to standby configuration (STDBY).
-6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed (steps 6a, 6b).
- The VIM executes the requested actions and sends back a NACK to consumer C2 (step 6d) as the
- migration of the virtual resource(s) is not completed by the given timeout.
+5. The Consumer C3 switches to standby configuration (STBY).
+6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed
+ (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2
+ (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout.
7. The VIM switches the physical resources to "enabled" state.
-8. MaintenanceResponse is sent from VIM to inform the Administrator that the maintenance action cannot start.
+8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action
+ cannot start.
..
diff --git a/docs/requirements/04-gaps.rst b/docs/requirements/04-gaps.rst
index 154f8e43..b8ff7f2e 100644
--- a/docs/requirements/04-gaps.rst
+++ b/docs/requirements/04-gaps.rst
@@ -61,6 +61,13 @@ Immediate Notification
- Fault notifications cannot be received immediately by Ceilometer.
+* Solved by
+
+ + Event Alarm Evaluator:
+ https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html
+ + New OpenStack alarms and notifications project AODH:
+ http://docs.openstack.org/developer/aodh/
+
Maintenance Notification
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -98,7 +105,7 @@ Maintenance Notification
- VIM user cannot receive maintenance notifications.
-* Related blueprints
+* Solved by
+ https://blueprints.launchpad.net/nova/+spec/service-status-notification
@@ -126,6 +133,10 @@ Normalization of data collection models
- Normalized data format does not exist.
+* Solved by
+
+ + Specification in Section :ref:`southbound`.
+
OpenStack
---------
@@ -157,7 +168,7 @@ ________________________________
- Ceilometer seems to be unsuitable for monitoring medium and large scale
NFVI deployments.
-* Related blueprints
+* Solved by
+ Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
higher number of fault events (up to 15 thousand events per second, but
@@ -189,13 +200,14 @@ ___________________________________
- OpenStack Ceilometer does not monitor hardware and software to capture
faults.
- + Gap
+ + Gap
- - Ceilometer is not able to detect and handle all faults listed in the Annex.
+ - Ceilometer is not able to detect and handle all faults listed in the Annex.
-* Related blueprints / workarounds
+* Solved by
- - Use other dedicated monitoring tools like Zabbix or Monasca
+ + Use of dedicated monitoring tools like Zabbix or Monasca.
+ See :ref:`nfvi_faults`.
Nova
^^^^
@@ -218,15 +230,14 @@ ________________________________________
+ To-be
- - There needs to be API to change VM power_State in case host has failed.
- - There needs to be API to change nova-compute state.
+ - The API shall support to change VM power state in case host has failed.
+ - The API shall support to change nova-compute state.
- There could be single API to change different VM states for all VMs
- belonging to specific host.
- - As external system monitoring the infra calls these APIs change can be
- fast and reliable.
- - Correlation actions can be faster and automated as states are reliable.
- - User will be able to read states from OpenStack and trust they are
- correct.
+ belonging to a specific host.
+ - Support external systems that are monitoring the infrastructure and resources
+ that are able to call the API fast and reliable.
+ - Resource states are reliable such that correlation actions can be fast and automated.
+ - User shall be able to read states from OpenStack and trust they are correct.
+ As-is
@@ -240,12 +251,11 @@ ________________________________________
+ Gap
- OpenStack does not change its states fast and reliably enough.
- - There is API missing to have external system to change states and to
- trust the states are then reliable (external system has fenced failed
- host).
+ - The API does not support to have an external system to change states and to
+ trust the states are reliable (external system has fenced failed host).
- User cannot read all the states from OpenStack nor trust they are right.
-* Related blueprints
+* Solved by
+ https://blueprints.launchpad.net/nova/+spec/mark-host-down
+ https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
@@ -309,7 +319,7 @@ _________________
underlying root cause of failure. Knowing the root cause can help filter
out unnecessary and overwhelming alarms.
-* Related blueprints / workarounds
+* Status
+ Monasca as of now lacks this feature, although the community is aware and
working toward supporting it.
@@ -334,7 +344,7 @@ _________________
- Sensor monitoring is very important. It provides operators status
on the state of the physical infrastructure (e.g. temperature, fans).
-* Related blueprints / workarounds
+* Addressed by
+ Monasca can be configured to use third-party monitoring solutions (e.g.
Nagios, Cacti) for retrieving additional data.
@@ -370,7 +380,10 @@ _____________________________
+ Gap
- - Cause of the delay needs to be identified and fixed
+ - Cause of the delay is a periodic evaluation and notification. Periodicity is configured
+ as 30s default value and can be reduced to 5s but not below.
+ https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329
+
..
vim: set tabstop=4 expandtab textwidth=80:
diff --git a/docs/requirements/05-implementation.rst b/docs/requirements/05-implementation.rst
index 4c89fdf5..84979772 100644
--- a/docs/requirements/05-implementation.rst
+++ b/docs/requirements/05-implementation.rst
@@ -672,47 +672,81 @@ and correlated alarms. Instead the AODH alarm class has attributes for actions,
rules and user and project id.
-+------------------------+------------------------+------------------------+
-| ETSI NFV Alarm Type | OPNFV Doctor Req Spec | AODH Alarm Type |
-+========================+========================+========================+
-| AlarmId | FaultId | Alarm Id |
-+------------------------+------------------------+------------------------+
-| managedObjectId | virtualResourceId | (N/A) |
-+------------------------+------------------------+------------------------+
-| \- | \- | User_Id, Project_Id |
-+------------------------+------------------------+------------------------+
-| alarmRaisedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmChangedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmClearedTime | \- | (N/A) |
-+------------------------+------------------------+------------------------+
-| alarmState: | virtualResourceState | State: ok, alarm, |
-| New, Updated, Cleared | (e.g. normal, | insufficient data |
-| | maintenance, down, | |
-| | error) | |
-+------------------------+------------------------+------------------------+
-| vrPerceivedSeverity: | Severity (Integer) | Severity: low, |
-| Critical, Major, Minor,| | moderate, critical |
-| Warning, Indeterminate,| | |
-| Cleared | | |
-+------------------------+------------------------+------------------------+
-| eventTime (unclear?) | EventTime | (N/A) |
-+------------------------+------------------------+------------------------+
-| faultType | FaultType | type |
-+------------------------+------------------------+------------------------+
-| probableCause | ProbableCause | description |
-+------------------------+------------------------+------------------------+
-| isRootCause | IsRootCause | \- |
-+------------------------+------------------------+------------------------+
-| correlatedAlarmId | CorrelatedFaultId | \- |
-+------------------------+------------------------+------------------------+
-| faultDetails | FaultDetails | \- |
-+------------------------+------------------------+------------------------+
-| \- | \- | actions, rule, time |
-| | | constraints |
-+------------------------+------------------------+------------------------+
-
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations |
+| | Requirement Specs | Notification | | |
++========================+========================+=====================+=============================================+=======================================+
+| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- |
+| | | | is part of the AODH reason parameter. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May |
+| | | | | be derived (e.g. in a shimlayer) from |
+| | | | | the AODH alarm history. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| eventTime | \- | \- | Timestamp when alarm was first observed by | see above |
+| | | | the Monitor. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. |
+| | | | | May be added in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. |
+| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be |
+| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be |
+| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. |
+| Minor, Warning, | | moderate, critical | | |
+| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: |
+| | | | | * update OPNFV Doctor specification |
+| | | | | to *Enum* |
+| | | | | |
+| | | | | **perceivedSeverity=Indetermined**: |
+| | | | | * remove value *Indetermined* in |
+| | | | | IFA and map undefined values to |
+| | | | | “minor” severity, or |
+| | | | | * add value *indetermined* in AODH |
+| | | | | and make it the default value. |
+| | | | | |
+| | | | | **perceivedSeverity=Cleared**: |
+| | | | | * remove value *Cleared* in IFA as |
+| | | | | the information about a cleared |
+| | | | | alarm alarm can be derived from |
+| | | | | the alarm state parameter, or |
+| | | | | * add value *cleared* in AODH and |
+| | | | | set a rule that the severity is |
+| | | | | “cleared” when the state is *ok*. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a |
+| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, |
+| | | | format. | "compute.cpu.failure". |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| N/A | N/A | type = "event" | Type of the notification. For fault | \- |
+| | | | notifications the type in AODH is “event”. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) |
+| | | | | based on Vitrage topology awareness / |
+| | | | | root-cause-analysis. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above |
+| | | | root cause of other faults. | |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will |
+| | | | | be specified in ETSI NFV Stage 3. |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
+| \- | \- | action, previous | Additional AODH alarm related parameters. | \- |
++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+
Table: Comparison of alarm attributes
@@ -728,6 +762,7 @@ Other areas that need alignment is the so called alarm state in NFV. Here we mus
however consider what can be attributes of the notification vs. what should be a
property of the alarm instance. This will be analyzed later.
+.. _southbound:
Detailed southbound interface specification
-------------------------------------------
diff --git a/docs/requirements/07-annex.rst b/docs/requirements/07-annex.rst
index 8cb19612..2ebba0d8 100644
--- a/docs/requirements/07-annex.rst
+++ b/docs/requirements/07-annex.rst
@@ -1,6 +1,8 @@
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
+.. _nfvi_faults:
+
Annex: NFVI Faults
=================================================
diff --git a/docs/requirements/images/figure1.png b/docs/requirements/images/figure1.png
index dacf0dd4..267ddddc 100755..100644
--- a/docs/requirements/images/figure1.png
+++ b/docs/requirements/images/figure1.png
Binary files differ
diff --git a/docs/requirements/images/figure2.png b/docs/requirements/images/figure2.png
index 3c8a2bf1..9a3b166d 100755..100644
--- a/docs/requirements/images/figure2.png
+++ b/docs/requirements/images/figure2.png
Binary files differ