From d0b22e1d856cf8f78e152dfb6c150e001e03dd52 Mon Sep 17 00:00:00 2001 From: Gerald Kunzmann Date: Tue, 14 Feb 2017 15:38:29 +0000 Subject: Update docs structure according to new guidelines in https://wiki.opnfv.org/display/DOC Change-Id: I1c8c20cf85aa46269c5bc369f17ab0020862ddc5 Signed-off-by: Gerald Kunzmann --- docs/design/index.rst | 27 - docs/design/inspector-design-guideline.rst | 46 - docs/design/notification-alarm-evaluator.rst | 248 ----- docs/design/performance-profiler.rst | 118 --- docs/design/port-data-plane-status.rst | 180 ---- ...st-fault-to-update-server-state-immediately.rst | 248 ----- docs/design/rfe-port-status-update.rst | 32 - docs/development/design/index.rst | 27 + .../design/inspector-design-guideline.rst | 46 + .../design/notification-alarm-evaluator.rst | 248 +++++ docs/development/design/performance-profiler.rst | 118 +++ docs/development/design/port-data-plane-status.rst | 180 ++++ ...st-fault-to-update-server-state-immediately.rst | 248 +++++ docs/development/design/rfe-port-status-update.rst | 32 + docs/development/index.rst | 21 + .../development/manuals/get-valid-server-state.rst | 125 +++ docs/development/manuals/index.rst | 13 + docs/development/manuals/mark-host-down_manual.rst | 122 +++ .../doctor-scenario-in-functest.rst | 126 +++ .../overview/functest_scenario/images/LICENSE | 14 + .../functest_scenario/images/figure-p1.png | Bin 0 -> 60756 bytes docs/development/requirements/01-intro.rst | 51 + docs/development/requirements/02-use_cases.rst | 195 ++++ docs/development/requirements/03-architecture.rst | 340 +++++++ docs/development/requirements/04-gaps.rst | 389 ++++++++ .../development/requirements/05-implementation.rst | 1050 ++++++++++++++++++++ docs/development/requirements/06-summary.rst | 24 + docs/development/requirements/07-annex.rst | 129 +++ docs/development/requirements/99-references.rst | 32 + docs/development/requirements/glossary.rst | 89 ++ docs/development/requirements/images/LICENSE | 14 + docs/development/requirements/images/figure1.png | Bin 0 -> 79420 bytes docs/development/requirements/images/figure10.png | Bin 0 -> 422212 bytes docs/development/requirements/images/figure11.png | Bin 0 -> 355225 bytes docs/development/requirements/images/figure12.png | Bin 0 -> 2144916 bytes docs/development/requirements/images/figure13.png | Bin 0 -> 646427 bytes docs/development/requirements/images/figure14.png | Bin 0 -> 578986 bytes docs/development/requirements/images/figure2.png | Bin 0 -> 82010 bytes docs/development/requirements/images/figure3.png | Bin 0 -> 308234 bytes docs/development/requirements/images/figure4.png | Bin 0 -> 186805 bytes docs/development/requirements/images/figure5a.png | Bin 0 -> 43787 bytes docs/development/requirements/images/figure5b.png | Bin 0 -> 45067 bytes docs/development/requirements/images/figure5c.png | Bin 0 -> 44400 bytes docs/development/requirements/images/figure6.png | Bin 0 -> 425612 bytes docs/development/requirements/images/figure7.png | Bin 0 -> 315053 bytes docs/development/requirements/images/figure8.png | Bin 0 -> 464430 bytes docs/development/requirements/images/figure9.png | Bin 0 -> 459353 bytes docs/development/requirements/index.rst | 62 ++ docs/index.rst | 24 - .../feature.configuration.rst | 104 -- docs/installationprocedure/index.rst | 12 - docs/manuals/get-valid-server-state.rst | 125 --- docs/manuals/index.rst | 13 - docs/manuals/mark-host-down_manual.rst | 122 --- docs/release/configguide/feature.configuration.rst | 104 ++ docs/release/configguide/index.rst | 12 + docs/release/index.rst | 19 + docs/release/installation/index.rst | 12 + docs/release/installation/releasenotes.rst | 113 +++ .../release/installation/releasenotes_colorado.rst | 170 ++++ docs/release/userguide/feature.userguide.rst | 44 + docs/release/userguide/index.rst | 12 + docs/releasenotes/index.rst | 10 - docs/releasenotes/releasenotes.rst | 113 --- docs/releasenotes/releasenotes_colorado.rst | 170 ---- docs/requirements/01-intro.rst | 51 - docs/requirements/02-use_cases.rst | 195 ---- docs/requirements/03-architecture.rst | 340 ------- docs/requirements/04-gaps.rst | 389 -------- docs/requirements/05-implementation.rst | 1050 -------------------- docs/requirements/06-summary.rst | 24 - docs/requirements/07-annex.rst | 129 --- docs/requirements/99-references.rst | 32 - docs/requirements/glossary.rst | 89 -- docs/requirements/images/LICENSE | 14 - docs/requirements/images/figure1.png | Bin 79420 -> 0 bytes docs/requirements/images/figure10.png | Bin 422212 -> 0 bytes docs/requirements/images/figure11.png | Bin 355225 -> 0 bytes docs/requirements/images/figure12.png | Bin 2144916 -> 0 bytes docs/requirements/images/figure13.png | Bin 646427 -> 0 bytes docs/requirements/images/figure14.png | Bin 578986 -> 0 bytes docs/requirements/images/figure2.png | Bin 82010 -> 0 bytes docs/requirements/images/figure3.png | Bin 308234 -> 0 bytes docs/requirements/images/figure4.png | Bin 186805 -> 0 bytes docs/requirements/images/figure5a.png | Bin 43787 -> 0 bytes docs/requirements/images/figure5b.png | Bin 45067 -> 0 bytes docs/requirements/images/figure5c.png | Bin 44400 -> 0 bytes docs/requirements/images/figure6.png | Bin 425612 -> 0 bytes docs/requirements/images/figure7.png | Bin 315053 -> 0 bytes docs/requirements/images/figure8.png | Bin 464430 -> 0 bytes docs/requirements/images/figure9.png | Bin 459353 -> 0 bytes docs/requirements/index.rst | 62 -- .../functest/doctor-scenario-in-functest.rst | 126 --- docs/scenarios/functest/images/LICENSE | 14 - docs/scenarios/functest/images/figure-p1.png | Bin 60756 -> 0 bytes docs/scenarios/index.rst | 12 - docs/userguide/feature.userguide.rst | 44 - docs/userguide/index.rst | 12 - 98 files changed, 4181 insertions(+), 4175 deletions(-) delete mode 100644 docs/design/index.rst delete mode 100644 docs/design/inspector-design-guideline.rst delete mode 100644 docs/design/notification-alarm-evaluator.rst delete mode 100644 docs/design/performance-profiler.rst delete mode 100644 docs/design/port-data-plane-status.rst delete mode 100644 docs/design/report-host-fault-to-update-server-state-immediately.rst delete mode 100644 docs/design/rfe-port-status-update.rst create mode 100644 docs/development/design/index.rst create mode 100644 docs/development/design/inspector-design-guideline.rst create mode 100644 docs/development/design/notification-alarm-evaluator.rst create mode 100644 docs/development/design/performance-profiler.rst create mode 100644 docs/development/design/port-data-plane-status.rst create mode 100644 docs/development/design/report-host-fault-to-update-server-state-immediately.rst create mode 100644 docs/development/design/rfe-port-status-update.rst create mode 100644 docs/development/index.rst create mode 100644 docs/development/manuals/get-valid-server-state.rst create mode 100644 docs/development/manuals/index.rst create mode 100644 docs/development/manuals/mark-host-down_manual.rst create mode 100644 docs/development/overview/functest_scenario/doctor-scenario-in-functest.rst create mode 100644 docs/development/overview/functest_scenario/images/LICENSE create mode 100755 docs/development/overview/functest_scenario/images/figure-p1.png create mode 100644 docs/development/requirements/01-intro.rst create mode 100644 docs/development/requirements/02-use_cases.rst create mode 100644 docs/development/requirements/03-architecture.rst create mode 100644 docs/development/requirements/04-gaps.rst create mode 100644 docs/development/requirements/05-implementation.rst create mode 100644 docs/development/requirements/06-summary.rst create mode 100644 docs/development/requirements/07-annex.rst create mode 100644 docs/development/requirements/99-references.rst create mode 100644 docs/development/requirements/glossary.rst create mode 100644 docs/development/requirements/images/LICENSE create mode 100644 docs/development/requirements/images/figure1.png create mode 100755 docs/development/requirements/images/figure10.png create mode 100755 docs/development/requirements/images/figure11.png create mode 100755 docs/development/requirements/images/figure12.png create mode 100755 docs/development/requirements/images/figure13.png create mode 100755 docs/development/requirements/images/figure14.png create mode 100644 docs/development/requirements/images/figure2.png create mode 100755 docs/development/requirements/images/figure3.png create mode 100755 docs/development/requirements/images/figure4.png create mode 100755 docs/development/requirements/images/figure5a.png create mode 100755 docs/development/requirements/images/figure5b.png create mode 100755 docs/development/requirements/images/figure5c.png create mode 100755 docs/development/requirements/images/figure6.png create mode 100755 docs/development/requirements/images/figure7.png create mode 100755 docs/development/requirements/images/figure8.png create mode 100755 docs/development/requirements/images/figure9.png create mode 100644 docs/development/requirements/index.rst delete mode 100755 docs/index.rst delete mode 100644 docs/installationprocedure/feature.configuration.rst delete mode 100644 docs/installationprocedure/index.rst delete mode 100644 docs/manuals/get-valid-server-state.rst delete mode 100644 docs/manuals/index.rst delete mode 100644 docs/manuals/mark-host-down_manual.rst create mode 100644 docs/release/configguide/feature.configuration.rst create mode 100644 docs/release/configguide/index.rst create mode 100644 docs/release/index.rst create mode 100644 docs/release/installation/index.rst create mode 100644 docs/release/installation/releasenotes.rst create mode 100644 docs/release/installation/releasenotes_colorado.rst create mode 100644 docs/release/userguide/feature.userguide.rst create mode 100644 docs/release/userguide/index.rst delete mode 100644 docs/releasenotes/index.rst delete mode 100644 docs/releasenotes/releasenotes.rst delete mode 100644 docs/releasenotes/releasenotes_colorado.rst delete mode 100644 docs/requirements/01-intro.rst delete mode 100644 docs/requirements/02-use_cases.rst delete mode 100644 docs/requirements/03-architecture.rst delete mode 100644 docs/requirements/04-gaps.rst delete mode 100644 docs/requirements/05-implementation.rst delete mode 100644 docs/requirements/06-summary.rst delete mode 100644 docs/requirements/07-annex.rst delete mode 100644 docs/requirements/99-references.rst delete mode 100644 docs/requirements/glossary.rst delete mode 100644 docs/requirements/images/LICENSE delete mode 100644 docs/requirements/images/figure1.png delete mode 100755 docs/requirements/images/figure10.png delete mode 100755 docs/requirements/images/figure11.png delete mode 100755 docs/requirements/images/figure12.png delete mode 100755 docs/requirements/images/figure13.png delete mode 100755 docs/requirements/images/figure14.png delete mode 100644 docs/requirements/images/figure2.png delete mode 100755 docs/requirements/images/figure3.png delete mode 100755 docs/requirements/images/figure4.png delete mode 100755 docs/requirements/images/figure5a.png delete mode 100755 docs/requirements/images/figure5b.png delete mode 100755 docs/requirements/images/figure5c.png delete mode 100755 docs/requirements/images/figure6.png delete mode 100755 docs/requirements/images/figure7.png delete mode 100755 docs/requirements/images/figure8.png delete mode 100755 docs/requirements/images/figure9.png delete mode 100644 docs/requirements/index.rst delete mode 100644 docs/scenarios/functest/doctor-scenario-in-functest.rst delete mode 100644 docs/scenarios/functest/images/LICENSE delete mode 100755 docs/scenarios/functest/images/figure-p1.png delete mode 100644 docs/scenarios/index.rst delete mode 100644 docs/userguide/feature.userguide.rst delete mode 100644 docs/userguide/index.rst (limited to 'docs') diff --git a/docs/design/index.rst b/docs/design/index.rst deleted file mode 100644 index 963002a0..00000000 --- a/docs/design/index.rst +++ /dev/null @@ -1,27 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -**************** -Design Documents -**************** - -This is the directory to store design documents which may include draft -versions of blueprints written before proposing to upstream OSS communities -such as OpenStack, in order to keep the original blueprint as reviewed in -OPNFV. That means there could be out-dated blueprints as result of further -refinements in the upstream OSS community. Please refer to the link in each -document to find the latest version of the blueprint and status of development -in the relevant OSS community. - -See also https://wiki.opnfv.org/requirements_projects . - -.. toctree:: - :numbered: - :maxdepth: 4 - - report-host-fault-to-update-server-state-immediately.rst - notification-alarm-evaluator.rst - rfe-port-status-update.rst - port-data-plane-status.rst - inspector-design-guideline.rst - performance-profiler.rst diff --git a/docs/design/inspector-design-guideline.rst b/docs/design/inspector-design-guideline.rst deleted file mode 100644 index 4add8c0f..00000000 --- a/docs/design/inspector-design-guideline.rst +++ /dev/null @@ -1,46 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -========================== -Inspector Design Guideline -========================== - -.. NOTE:: - This is spec draft of design guideline for inspector component. - JIRA ticket to track the update and collect comments: `DOCTOR-73`_. - -This document summarize the best practise in designing a high performance -inspector to meet the requirements in `OPNFV Doctor project`_. - -Problem Description -=================== - -Some pitfalls has be detected during the development of sample inspector, e.g. -we suffered a significant `performance degrading in listing VMs in a host`_. - -A `patch set for caching the list`_ has been committed to solve issue. When a -new inspector is integrated, it would be nice to have an evaluation of existing -design and give recommendations for improvements. - -This document can be treated as a source of related blueprints in inspector -projects. - -Guidelines -========== - -Host specific VMs list ----------------------- - -TBD, see `DOCTOR-76`_. - -Parallel execution ------------------- - -TBD, see `discussion in mailing list`_. - -.. _DOCTOR-73: https://jira.opnfv.org/browse/DOCTOR-73 -.. _OPNFV Doctor project: https://wiki.opnfv.org/doctor -.. _performance degrading in listing VMs in a host: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012591.html -.. _patch set for caching the list: https://gerrit.opnfv.org/gerrit/#/c/20877/ -.. _DOCTOR-76: https://jira.opnfv.org/browse/DOCTOR-76 -.. _discussion in mailing list: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-October/013036.html diff --git a/docs/design/notification-alarm-evaluator.rst b/docs/design/notification-alarm-evaluator.rst deleted file mode 100644 index d1bf787a..00000000 --- a/docs/design/notification-alarm-evaluator.rst +++ /dev/null @@ -1,248 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -============================ -Notification Alarm Evaluator -============================ - -.. NOTE:: - This is spec draft of blueprint for OpenStack Ceilomter Liberty. - To see current version: https://review.openstack.org/172893 - To track development activity: - https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator - -https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator - -This blueprint proposes to add a new alarm evaluator for handling alarms on -events passed from other OpenStack services, that provides event-driven alarm -evaluation which makes new sequence in Ceilometer instead of the polling-based -approach of the existing Alarm Evaluator, and realizes immediate alarm -notification to end users. - -Problem description -=================== - -As an end user, I need to receive alarm notification immediately once -Ceilometer captured an event which would make alarm fired, so that I can -perform recovery actions promptly to shorten downtime of my service. -The typical use case is that an end user set alarm on "compute.instance.update" -in order to trigger recovery actions once the instance status has changed to -'shutdown' or 'error'. It should be nice that an end user can receive -notification within 1 second after fault observed as the same as other helth- -check mechanisms can do in some cases. - -The existing Alarm Evaluator is periodically querying/polling the databases -in order to check all alarms independently from other processes. This is good -approach for evaluating an alarm on samples stored in a certain period. -However, this is not efficient to evaluate an alarm on events which are emitted -by other OpenStack servers once in a while. - -The periodical evaluation leads delay on sending alarm notification to users. -The default period of evaluation cycle is 60 seconds. It is recommended that -an operator set longer interval than configured pipeline interval for -underlying metrics, and also longer enough to evaluate all defined alarms -in certain period while taking into account the number of resources, users and -alarms. - -Proposed change -=============== - -The proposal is to add a new event-driven alarm evaluator which receives -messages from Notification Agent and finds related Alarms, then evaluates each -alarms; - -* New alarm evaluator could receive event notification from Notification Agent - by which adding a dedicated notifier as a publisher in pipeline.yaml - (e.g. notifier://?topic=event_eval). - -* When new alarm evaluator received event notification, it queries alarm - database by Project ID and Resource ID written in the event notification. - -* Found alarms are evaluated by referring event notification. - -* Depending on the result of evaluation, those alarms would be fired through - Alarm Notifier as the same as existing Alarm Evaluator does. - -This proposal also adds new alarm type "notification" and "notification_rule". -This enables users to create alarms on events. The separation from other alarm -types (such as "threshold" type) is intended to show different timing of -evaluation and different format of condition, since the new evaluator will -check each event notification once it received whereas "threshold" alarm can -evaluate average of values in certain period calculated from multiple samples. - -The new alarm evaluator handles Notification type alarms, so we have to change -existing alarm evaluator to exclude "notification" type alarms from evaluation -targets. - -Alternatives ------------- - -There was similar blueprint proposal "Alarm type based on notification", but -the approach is different. The old proposal was to adding new step (alarm -evaluations) in Notification Agent every time it received event from other -OpenStack services, whereas this proposal intends to execute alarm evaluation -in another component which can minimize impact to existing pipeline processing. - -Another approach is enhancement of existing alarm evaluator by adding -notification listener. However, there are two issues; 1) this approach could -cause stall of periodical evaluations when it receives bulk of notifications, -and 2) this could break the alarm portioning i.e. when alarm evaluator received -notification, it might have to evaluate some alarms which are not assign to it. - -Data model impact ------------------ - -Resource ID will be added to Alarm model as an optional attribute. -This would help the new alarm evaluator to filter out non-related alarms -while querying alarms, otherwise it have to evaluate all alarms in the project. - -REST API impact ---------------- - -Alarm API will be extended as follows; - -* Add "notification" type into alarm type list -* Add "resource_id" to "alarm" -* Add "notification_rule" to "alarm" - -Sample data of Notification-type alarm:: - - { - "alarm_actions": [ - "http://site:8000/alarm" - ], - "alarm_id": null, - "description": "An alarm", - "enabled": true, - "insufficient_data_actions": [ - "http://site:8000/nodata" - ], - "name": "InstanceStatusAlarm", - "notification_rule": { - "event_type": "compute.instance.update", - "query" : [ - { - "field" : "traits.state", - "type" : "string", - "value" : "error", - "op" : "eq", - }, - ] - }, - "ok_actions": [], - "project_id": "c96c887c216949acbdfbd8b494863567", - "repeat_actions": false, - "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856", - "severity": "moderate", - "state": "ok", - "state_timestamp": "2015-04-03T17:49:38.406845", - "timestamp": "2015-04-03T17:49:38.406839", - "type": "notification", - "user_id": "c96c887c216949acbdfbd8b494863567" - } - -"resource_id" will be refered to query alarm and will not be check permission -and belonging of project. - -Security impact ---------------- - -None - -Pipeline impact ---------------- - -None - -Other end user impact ---------------------- - -None - -Performance/Scalability Impacts -------------------------------- - -When Ceilomter received a number of events from other OpenStack services in -short period, this alarm evaluator can keep working since events are queued in -a messaging queue system, but it can cause delay of alarm notification to users -and increase the number of read and write access to alarm database. - -"resource_id" can be optional, but restricting it to mandatory could be reduce -performance impact. If user create "notification" alarm without "resource_id", -those alarms will be evaluated every time event occurred in the project. -That may lead new evaluator heavy. - -Other deployer impact ---------------------- - -New service process have to be run. - -Developer impact ----------------- - -Developers should be aware that events could be notified to end users and avoid -passing raw infra information to end users, while defining events and traits. - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: - r-mibu - -Other contributors: - None - -Ongoing maintainer: - None - -Work Items ----------- - -* New event-driven alarm evaluator - -* Add new alarm type "notification" as well as AlarmNotificationRule - -* Add "resource_id" to Alarm model - -* Modify existing alarm evaluator to filter out "notification" alarms - -* Add new config parameter for alarm request check whether accepting alarms - without specifying "resource_id" or not - -Future lifecycle -================ - -This proposal is key feature to provide information of cloud resources to end -users in real-time that enables efficient integration with user-side manager -or Orchestrator, whereas currently those information are considered to be -consumed by admin side tool or service. -Based on this change, we will seek orchestrating scenarios including fault -recovery and add useful event definition as well as additional traits. - -Dependencies -============ - -None - -Testing -======= - -New unit/scenario tests are required for this change. - -Documentation Impact -==================== - -* Proposed evaluator will be described in the developer document. - -* New alarm type and how to use will be explained in user guide. - -References -========== - -* OPNFV Doctor project: https://wiki.opnfv.org/doctor - -* Blueprint "Alarm type based on notification": - https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification diff --git a/docs/design/performance-profiler.rst b/docs/design/performance-profiler.rst deleted file mode 100644 index f834a915..00000000 --- a/docs/design/performance-profiler.rst +++ /dev/null @@ -1,118 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - - -==================== -Performance Profiler -==================== - -https://goo.gl/98Osig - -This blueprint proposes to create a performance profiler for doctor scenarios. - -Problem Description -=================== - -In the verification job for notification time, we have encountered some -performance issues, such as - -1. In environment deployed by APEX, it meets the criteria while in the one by -Fuel, the performance is much more poor. -2. Signification performance degradation was spotted when we increase the total -number of VMs - -It takes time to dig the log and analyse the reason. People have to collect -timestamp at each checkpoints manually to find out the bottleneck. A performance -profiler will make this process automatic. - -Proposed Change -=============== - -Current Doctor scenario covers the inspector and notifier in the whole fault -management cycle:: - - start end - + + + + + + - | | | | | | - |monitor|inspector|notifier|manager|controller| - +------>+ | | | | - occurred +-------->+ | | | - | detected +------->+ | | - | | identified +-------+ | - | | notified +--------->+ - | | | processed resolved - | | | | - | +<-----doctor----->+ | - | | - | | - +<---------------fault management------------>+ - -The notification time can be split into several parts and visualized as a -timeline:: - - start end - 0----5---10---15---20---25---30---35---40---45--> (x 10ms) - + + + + + + + + + + + - 0-hostdown | | | | | | | | | - +--->+ | | | | | | | | | - | 1-raw failure | | | | | | | - | +-->+ | | | | | | | | - | | 2-found affected | | | | | - | | +-->+ | | | | | | | - | | 3-marked host down| | | | | - | | +-->+ | | | | | | - | | 4-set VM error| | | | | - | | +--->+ | | | | | - | | | 5-notified VM error | | - | | | +----->| | | | | - | | | | 6-transformed event - | | | | +-->+ | | | - | | | | | 7-evaluated event - | | | | | +-->+ | | - | | | | | 8-fired alarm - | | | | | +-->+ | - | | | | | 9-received alarm - | | | | | +-->+ - sample | sample | | | |10-handled alarm - monitor| inspector |nova| c/m | aodh | - | | - +<-----------------doctor--------------->+ - -Note: c/m = ceilometer - -And a table of components sorted by time cost from most to least - -+----------+---------+----------+ -|Component |Time Cost|Percentage| -+==========+=========+==========+ -|inspector |160ms | 40% | -+----------+---------+----------+ -|aodh |110ms | 30% | -+----------+---------+----------+ -|monitor |50ms | 14% | -+----------+---------+----------+ -|... | | | -+----------+---------+----------+ -|... | | | -+----------+---------+----------+ - -Note: data in the table is for demonstration only, not actual measurement - -Timestamps can be collected from various sources - -1. log files -2. trace point in code - -The performance profiler will be integrated into the verification job to provide -detail result of the test. It can also be deployed independently to diagnose -performance issue in specified environment. - -Working Items -============= - -1. PoC with limited checkpoints -2. Integration with verification job -3. Collect timestamp at all checkpoints -4. Display the profiling result in console -5. Report the profiling result to test database -6. Independent package which can be installed to specified environment diff --git a/docs/design/port-data-plane-status.rst b/docs/design/port-data-plane-status.rst deleted file mode 100644 index 06cfc3c6..00000000 --- a/docs/design/port-data-plane-status.rst +++ /dev/null @@ -1,180 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -==================================== -Port data plane status -==================================== - -https://bugs.launchpad.net/neutron/+bug/1598081 - -Neutron does not detect data plane failures affecting its logical resources. -This spec addresses that issue by means of allowing external tools to report to -Neutron about faults in the data plane that are affecting the ports. A new REST -API field is proposed to that end. - - -Problem Description -=================== - -An initial description of the problem was introduced in bug #159801 [1_]. This -spec focuses on capturing one (main) part of the problem there described, i.e. -extending Neutron's REST API to cover the scenario of allowing external tools -to report network failures to Neutron. Out of scope of this spec are works to -enable port status changes to be received and managed by mechanism drivers. - -This spec also tries to address bug #1575146 [2_]. Specifically, and argued by -the Neutron driver team in [3_]: - - * Neutron should not shut down the port completly upon detection of physnet - failure; connectivity between instances on the same node may still be - reachable. Externals tools may or may not want to trigger a status change on - the port based on their own logic and orchestration. - - * Port down is not detected when an uplink of a switch is down; - - * The physnet bridge may have multiple physical interfaces plugged; shutting - down the logical port may not be needed in case network redundancy is in - place. - - -Proposed Change -=============== - -A couple of possible approaches were proposed in [1_] (comment #3). This spec -proposes tackling the problema via a new extension API to the port resource. -The extension adds a new attribute 'dp-down' (data plane down) to represent the -status of the data plane. The field should be read-only by tenants and -read-write by admins. - -Neutron should send out an event to the message bus upon toggling the data -plane status value. The event is relevant for e.g. auditing. - - -Data Model Impact ------------------ - -A new attribute as extension will be added to the 'ports' table. - -+------------+-------+----------+---------+--------------------+--------------+ -|Attribute |Type |Access |Default |Validation/ |Description | -|Name | | |Value |Conversion | | -+============+=======+==========+=========+====================+==============+ -|dp_down |boolean|RO, tenant|False |True/False | | -| | |RW, admin | | | | -+------------+-------+----------+---------+--------------------+--------------+ - - -REST API Impact ---------------- - -A new API extension to the ports resource is going to be introduced. - -.. code-block:: python - - EXTENDED_ATTRIBUTES_2_0 = { - 'ports': { - 'dp_down': {'allow_post': False, 'allow_put': True, - 'default': False, 'convert_to': convert_to_boolean, - 'is_visible': True}, - }, - } - - -Examples -~~~~~~~~ - -Updating port data plane status to down: - -.. code-block:: json - - PUT /v2.0/ports/ - Accept: application/json - { - "port": { - "dp_down": true - } - } - - - -Command Line Client Impact --------------------------- - -:: - - neutron port-update [--dp-down ] - openstack port set [--dp-down ] - -Argument --dp-down is optional. Defaults to False. - - -Security Impact ---------------- - -None - -Notifications Impact --------------------- - -A notification (event) upon toggling the data plane status (i.e. 'dp-down' -attribute) value should be sent to the message bus. Such events do not happen -with high frequency and thus no negative impact on the notification bus is -expected. - -Performance Impact ------------------- - -None - -IPv6 Impact ------------ - -None - -Other Deployer Impact ---------------------- - -None - -Developer Impact ----------------- - -None - -Implementation -============== - -Assignee(s) ------------ - - * cgoncalves - -Work Items ----------- - - * New 'dp-down' attribute in 'ports' database table - * API extension to introduce new field to port - * Client changes to allow for data plane status (i.e. 'dp-down' attribute') - being set - * Policy (tenants read-only; admins read-write) - - -Documentation Impact -==================== - -Documentation for both administrators and end users will have to be -contemplated. Administrators will need to know how to set/unset the data plane -status field. - - -References -========== - -.. [1] RFE: Port status update, - https://bugs.launchpad.net/neutron/+bug/1598081 - -.. [2] RFE: ovs port status should the same as physnet - https://bugs.launchpad.net/neutron/+bug/1575146 - -.. [3] Neutron Drivers meeting, July 21, 2016 - http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-07-21-22.00.html diff --git a/docs/design/report-host-fault-to-update-server-state-immediately.rst b/docs/design/report-host-fault-to-update-server-state-immediately.rst deleted file mode 100644 index 2f6ce145..00000000 --- a/docs/design/report-host-fault-to-update-server-state-immediately.rst +++ /dev/null @@ -1,248 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -.. NOTE:: - This is a specification draft of a blueprint proposed for OpenStack Nova - Liberty. It was written by project member(s) and agreed within the project - before submitting it upstream. No further changes to its content will be - made here anymore; please follow it upstream: - - * Current version upstream: https://review.openstack.org/#/c/169836/ - * Development activity: - https://blueprints.launchpad.net/nova/+spec/mark-host-down - - **Original draft is as follow:** - -==================================================== -Report host fault to update server state immediately -==================================================== - -https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately - -A new API is needed to report a host fault to change the state of the -instances and compute node immediately. This allows usage of evacuate API -without a delay. The new API provides the possibility for external monitoring -system to detect any kind of host failure fast and reliably and inform -OpenStack about it. Nova updates the compute node state and states of the -instances. This way the states in the Nova DB will be in sync with the -real state of the system. - -Problem description -=================== -* Nova state change for failed or unreachable host is slow and does not - reliably state compute node is down or not. This might cause same instance - to run twice if action taken to evacuate instance to another host. -* Nova state for instances on failed compute node will not change, - but remains active and running. This gives user a false information about - instance state. Currently one would need to call "nova reset-state" for each - instance to have them in error state. -* OpenStack user cannot make HA actions fast and reliably by trusting instance - state and compute node state. -* As compute node state changes slowly one cannot evacuate instances. - -Use Cases ---------- -Use case in general is that in case there is a host fault one should change -compute node state fast and reliably when using DB servicegroup backend. -On top of this here is the use cases that are not covered currently to have -instance states changed correctly: -* Management network connectivity lost between controller and compute node. -* Host HW failed. - -Generic use case flow: - -* The external monitoring system detects a host fault. -* The external monitoring system fences the host if not down already. -* The external system calls the new Nova API to force the failed compute node - into down state as well as instances running on it. -* Nova updates the compute node state and state of the effected instances to - Nova DB. - -Currently nova-compute state will be changing "down", but it takes a long -time. Server state keeps as "vm_state: active" and "power_state: -running", which is not correct. By having external tool to detect host faults -fast, fence host by powering down and then report host down to OpenStack, all -these states would reflect to actual situation. Also if OpenStack will not -implement automatic actions for fault correlation, external tool can do that. -This could be configured for example in server instance METADATA easily and be -read by external tool. - -Project Priority ------------------ -Liberty priorities have not yet been defined. - -Proposed change -=============== -There needs to be a new API for Admin to state host is down. This API is used -to mark compute node and instances running on it down to reflect the real -situation. - -Example on compute node is: - -* When compute node is up and running: - vm_state: active and power_state: running - nova-compute state: up status: enabled -* When compute node goes down and new API is called to state host is down: - vm_state: stopped power_state: shutdown - nova-compute state: down status: enabled - -vm_state values: soft-delete, deleted, resized and error -should not be touched. -task_state effect needs to be worked out if needs to be touched. - -Alternatives ------------- -There is no attractive alternatives to detect all different host faults than -to have a external tool to detect different host faults. For this kind of tool -to exist there needs to be new API in Nova to report fault. Currently there -must have been some kind of workarounds implemented as cannot trust or get the -states from OpenStack fast enough. - -Data model impact ------------------ -None - -REST API impact ---------------- -* Update CLI to report host is down - - nova host-update command - - usage: nova host-update [--status ] - [--maintenance ] - [--report-host-down] - - - Update host settings. - - Positional arguments - - - Name of host. - - Optional arguments - - --status - Either enable or disable a host. - - --maintenance - Either put or resume host to/from maintenance. - - --down - Report host down to update instance and compute node state in db. - -* Update Compute API to report host is down: - - /v2.1/{tenant_id}/os-hosts/{host_name} - - Normal response codes: 200 - Request parameters - - Parameter Style Type Description - host_name URI xsd:string The name of the host of interest to you. - - { - "host": { - "status": "enable", - "maintenance_mode": "enable" - "host_down_reported": "true" - - } - - } - - { - "host": { - "host": "65c5d5b7e3bd44308e67fc50f362aee6", - "maintenance_mode": "enabled", - "status": "enabled" - "host_down_reported": "true" - - } - - } - -* New method to nova.compute.api module HostAPI class to have a - to mark host related instances and compute node down: - set_host_down(context, host_name) - -* class novaclient.v2.hosts.HostManager(api) method update(host, values) - Needs to handle reporting host down. - -* Schema does not need changes as in db only service and server states are to - be changed. - -Security impact ---------------- -API call needs admin privileges (in the default policy configuration). - -Notifications impact --------------------- -None - -Other end user impact ---------------------- -None - -Performance Impact ------------------- -Only impact is that user can get information faster about instance and -compute node state. This also gives possibility to evacuate faster. -No impact that would slow down. Host down should be rare occurrence. - -Other deployer impact ---------------------- -Developer can make use of any external tool to detect host fault and report it -to OpenStack. - -Developer impact ----------------- -None - -Implementation -============== -Assignee(s) ------------ -Primary assignee: Tomi Juvonen -Other contributors: Ryota Mibu - -Work Items ----------- -* Test cases. -* API changes. -* Documentation. - -Dependencies -============ -None - -Testing -======= -Test cases that exists for enabling or putting host to maintenance should be -altered or similar new cases made test new functionality. - -Documentation Impact -==================== - -New API needs to be documented: - -* Compute API extensions documentation. - http://developer.openstack.org/api-ref-compute-v2.1.html -* Nova commands documentation. - http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html -* Compute command-line client documentation. - http://docs.openstack.org/cli-reference/content/novaclient_commands.html -* nova.compute.api documentation. - http://docs.openstack.org/developer/nova/api/nova.compute.api.html -* High Availability guide might have page to tell external tool could provide - ability to provide faster HA as able to update states by new API. - http://docs.openstack.org/high-availability-guide/content/index.html - -References -========== -* OPNFV Doctor project: https://wiki.opnfv.org/doctor -* OpenStack Instance HA Proposal: - http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/ -* The Different Facets of OpenStack HA: - http://blog.russellbryant.net/2015/03/10/ - the-different-facets-of-openstack-ha/ diff --git a/docs/design/rfe-port-status-update.rst b/docs/design/rfe-port-status-update.rst deleted file mode 100644 index d87d7d7b..00000000 --- a/docs/design/rfe-port-status-update.rst +++ /dev/null @@ -1,32 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -========================== -Neutron Port Status Update -========================== - -.. NOTE:: - This document represents a Neutron RFE reviewed in the Doctor project before submitting upstream to Launchpad Neutron - space. The document is not intended to follow a blueprint format or to be an extensive document. - For more information, please visit http://docs.openstack.org/developer/neutron/policies/blueprints.html - - The RFE was submitted to Neutron. You can follow the discussions in https://bugs.launchpad.net/neutron/+bug/1598081 - -Neutron port status field represents the current status of a port in the cloud infrastructure. The field can take one of -the following values: 'ACTIVE', 'DOWN', 'BUILD' and 'ERROR'. - -At present, if a network event occurs in the data-plane (e.g. virtual or physical switch fails or one of its ports, -cable gets pulled unintentionally, infrastructure topology changes, etc.), connectivity to logical ports may be affected -and tenants' services interrupted. When tenants/cloud administrators are looking up their resources' status (e.g. Nova -instances and services running in them, network ports, etc.), they will wrongly see everything looks fine. The problem -is that Neutron will continue reporting port 'status' as 'ACTIVE'. - -Many SDN Controllers managing network elements have the ability to detect and report network events to upper layers. -This allows SDN Controllers' users to be notified of changes and react accordingly. Such information could be consumed -by Neutron so that Neutron could update the 'status' field of those logical ports, and additionally generate a -notification message to the message bus. - -However, Neutron misses a way to be able to receive such information through e.g. ML2 driver or the REST API ('status' -field is read-only). There are pros and cons on both of these approaches as well as other possible approaches. This RFE -intends to trigger a discussion on how Neutron could be improved to receive fault/change events from SDN Controllers or -even also from 3rd parties not in charge of controlling the network (e.g. monitoring systems, human admins). diff --git a/docs/development/design/index.rst b/docs/development/design/index.rst new file mode 100644 index 00000000..963002a0 --- /dev/null +++ b/docs/development/design/index.rst @@ -0,0 +1,27 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +**************** +Design Documents +**************** + +This is the directory to store design documents which may include draft +versions of blueprints written before proposing to upstream OSS communities +such as OpenStack, in order to keep the original blueprint as reviewed in +OPNFV. That means there could be out-dated blueprints as result of further +refinements in the upstream OSS community. Please refer to the link in each +document to find the latest version of the blueprint and status of development +in the relevant OSS community. + +See also https://wiki.opnfv.org/requirements_projects . + +.. toctree:: + :numbered: + :maxdepth: 4 + + report-host-fault-to-update-server-state-immediately.rst + notification-alarm-evaluator.rst + rfe-port-status-update.rst + port-data-plane-status.rst + inspector-design-guideline.rst + performance-profiler.rst diff --git a/docs/development/design/inspector-design-guideline.rst b/docs/development/design/inspector-design-guideline.rst new file mode 100644 index 00000000..4add8c0f --- /dev/null +++ b/docs/development/design/inspector-design-guideline.rst @@ -0,0 +1,46 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +========================== +Inspector Design Guideline +========================== + +.. NOTE:: + This is spec draft of design guideline for inspector component. + JIRA ticket to track the update and collect comments: `DOCTOR-73`_. + +This document summarize the best practise in designing a high performance +inspector to meet the requirements in `OPNFV Doctor project`_. + +Problem Description +=================== + +Some pitfalls has be detected during the development of sample inspector, e.g. +we suffered a significant `performance degrading in listing VMs in a host`_. + +A `patch set for caching the list`_ has been committed to solve issue. When a +new inspector is integrated, it would be nice to have an evaluation of existing +design and give recommendations for improvements. + +This document can be treated as a source of related blueprints in inspector +projects. + +Guidelines +========== + +Host specific VMs list +---------------------- + +TBD, see `DOCTOR-76`_. + +Parallel execution +------------------ + +TBD, see `discussion in mailing list`_. + +.. _DOCTOR-73: https://jira.opnfv.org/browse/DOCTOR-73 +.. _OPNFV Doctor project: https://wiki.opnfv.org/doctor +.. _performance degrading in listing VMs in a host: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012591.html +.. _patch set for caching the list: https://gerrit.opnfv.org/gerrit/#/c/20877/ +.. _DOCTOR-76: https://jira.opnfv.org/browse/DOCTOR-76 +.. _discussion in mailing list: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-October/013036.html diff --git a/docs/development/design/notification-alarm-evaluator.rst b/docs/development/design/notification-alarm-evaluator.rst new file mode 100644 index 00000000..d1bf787a --- /dev/null +++ b/docs/development/design/notification-alarm-evaluator.rst @@ -0,0 +1,248 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +============================ +Notification Alarm Evaluator +============================ + +.. NOTE:: + This is spec draft of blueprint for OpenStack Ceilomter Liberty. + To see current version: https://review.openstack.org/172893 + To track development activity: + https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator + +https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator + +This blueprint proposes to add a new alarm evaluator for handling alarms on +events passed from other OpenStack services, that provides event-driven alarm +evaluation which makes new sequence in Ceilometer instead of the polling-based +approach of the existing Alarm Evaluator, and realizes immediate alarm +notification to end users. + +Problem description +=================== + +As an end user, I need to receive alarm notification immediately once +Ceilometer captured an event which would make alarm fired, so that I can +perform recovery actions promptly to shorten downtime of my service. +The typical use case is that an end user set alarm on "compute.instance.update" +in order to trigger recovery actions once the instance status has changed to +'shutdown' or 'error'. It should be nice that an end user can receive +notification within 1 second after fault observed as the same as other helth- +check mechanisms can do in some cases. + +The existing Alarm Evaluator is periodically querying/polling the databases +in order to check all alarms independently from other processes. This is good +approach for evaluating an alarm on samples stored in a certain period. +However, this is not efficient to evaluate an alarm on events which are emitted +by other OpenStack servers once in a while. + +The periodical evaluation leads delay on sending alarm notification to users. +The default period of evaluation cycle is 60 seconds. It is recommended that +an operator set longer interval than configured pipeline interval for +underlying metrics, and also longer enough to evaluate all defined alarms +in certain period while taking into account the number of resources, users and +alarms. + +Proposed change +=============== + +The proposal is to add a new event-driven alarm evaluator which receives +messages from Notification Agent and finds related Alarms, then evaluates each +alarms; + +* New alarm evaluator could receive event notification from Notification Agent + by which adding a dedicated notifier as a publisher in pipeline.yaml + (e.g. notifier://?topic=event_eval). + +* When new alarm evaluator received event notification, it queries alarm + database by Project ID and Resource ID written in the event notification. + +* Found alarms are evaluated by referring event notification. + +* Depending on the result of evaluation, those alarms would be fired through + Alarm Notifier as the same as existing Alarm Evaluator does. + +This proposal also adds new alarm type "notification" and "notification_rule". +This enables users to create alarms on events. The separation from other alarm +types (such as "threshold" type) is intended to show different timing of +evaluation and different format of condition, since the new evaluator will +check each event notification once it received whereas "threshold" alarm can +evaluate average of values in certain period calculated from multiple samples. + +The new alarm evaluator handles Notification type alarms, so we have to change +existing alarm evaluator to exclude "notification" type alarms from evaluation +targets. + +Alternatives +------------ + +There was similar blueprint proposal "Alarm type based on notification", but +the approach is different. The old proposal was to adding new step (alarm +evaluations) in Notification Agent every time it received event from other +OpenStack services, whereas this proposal intends to execute alarm evaluation +in another component which can minimize impact to existing pipeline processing. + +Another approach is enhancement of existing alarm evaluator by adding +notification listener. However, there are two issues; 1) this approach could +cause stall of periodical evaluations when it receives bulk of notifications, +and 2) this could break the alarm portioning i.e. when alarm evaluator received +notification, it might have to evaluate some alarms which are not assign to it. + +Data model impact +----------------- + +Resource ID will be added to Alarm model as an optional attribute. +This would help the new alarm evaluator to filter out non-related alarms +while querying alarms, otherwise it have to evaluate all alarms in the project. + +REST API impact +--------------- + +Alarm API will be extended as follows; + +* Add "notification" type into alarm type list +* Add "resource_id" to "alarm" +* Add "notification_rule" to "alarm" + +Sample data of Notification-type alarm:: + + { + "alarm_actions": [ + "http://site:8000/alarm" + ], + "alarm_id": null, + "description": "An alarm", + "enabled": true, + "insufficient_data_actions": [ + "http://site:8000/nodata" + ], + "name": "InstanceStatusAlarm", + "notification_rule": { + "event_type": "compute.instance.update", + "query" : [ + { + "field" : "traits.state", + "type" : "string", + "value" : "error", + "op" : "eq", + }, + ] + }, + "ok_actions": [], + "project_id": "c96c887c216949acbdfbd8b494863567", + "repeat_actions": false, + "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856", + "severity": "moderate", + "state": "ok", + "state_timestamp": "2015-04-03T17:49:38.406845", + "timestamp": "2015-04-03T17:49:38.406839", + "type": "notification", + "user_id": "c96c887c216949acbdfbd8b494863567" + } + +"resource_id" will be refered to query alarm and will not be check permission +and belonging of project. + +Security impact +--------------- + +None + +Pipeline impact +--------------- + +None + +Other end user impact +--------------------- + +None + +Performance/Scalability Impacts +------------------------------- + +When Ceilomter received a number of events from other OpenStack services in +short period, this alarm evaluator can keep working since events are queued in +a messaging queue system, but it can cause delay of alarm notification to users +and increase the number of read and write access to alarm database. + +"resource_id" can be optional, but restricting it to mandatory could be reduce +performance impact. If user create "notification" alarm without "resource_id", +those alarms will be evaluated every time event occurred in the project. +That may lead new evaluator heavy. + +Other deployer impact +--------------------- + +New service process have to be run. + +Developer impact +---------------- + +Developers should be aware that events could be notified to end users and avoid +passing raw infra information to end users, while defining events and traits. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + r-mibu + +Other contributors: + None + +Ongoing maintainer: + None + +Work Items +---------- + +* New event-driven alarm evaluator + +* Add new alarm type "notification" as well as AlarmNotificationRule + +* Add "resource_id" to Alarm model + +* Modify existing alarm evaluator to filter out "notification" alarms + +* Add new config parameter for alarm request check whether accepting alarms + without specifying "resource_id" or not + +Future lifecycle +================ + +This proposal is key feature to provide information of cloud resources to end +users in real-time that enables efficient integration with user-side manager +or Orchestrator, whereas currently those information are considered to be +consumed by admin side tool or service. +Based on this change, we will seek orchestrating scenarios including fault +recovery and add useful event definition as well as additional traits. + +Dependencies +============ + +None + +Testing +======= + +New unit/scenario tests are required for this change. + +Documentation Impact +==================== + +* Proposed evaluator will be described in the developer document. + +* New alarm type and how to use will be explained in user guide. + +References +========== + +* OPNFV Doctor project: https://wiki.opnfv.org/doctor + +* Blueprint "Alarm type based on notification": + https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification diff --git a/docs/development/design/performance-profiler.rst b/docs/development/design/performance-profiler.rst new file mode 100644 index 00000000..f834a915 --- /dev/null +++ b/docs/development/design/performance-profiler.rst @@ -0,0 +1,118 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + + +==================== +Performance Profiler +==================== + +https://goo.gl/98Osig + +This blueprint proposes to create a performance profiler for doctor scenarios. + +Problem Description +=================== + +In the verification job for notification time, we have encountered some +performance issues, such as + +1. In environment deployed by APEX, it meets the criteria while in the one by +Fuel, the performance is much more poor. +2. Signification performance degradation was spotted when we increase the total +number of VMs + +It takes time to dig the log and analyse the reason. People have to collect +timestamp at each checkpoints manually to find out the bottleneck. A performance +profiler will make this process automatic. + +Proposed Change +=============== + +Current Doctor scenario covers the inspector and notifier in the whole fault +management cycle:: + + start end + + + + + + + + | | | | | | + |monitor|inspector|notifier|manager|controller| + +------>+ | | | | + occurred +-------->+ | | | + | detected +------->+ | | + | | identified +-------+ | + | | notified +--------->+ + | | | processed resolved + | | | | + | +<-----doctor----->+ | + | | + | | + +<---------------fault management------------>+ + +The notification time can be split into several parts and visualized as a +timeline:: + + start end + 0----5---10---15---20---25---30---35---40---45--> (x 10ms) + + + + + + + + + + + + + 0-hostdown | | | | | | | | | + +--->+ | | | | | | | | | + | 1-raw failure | | | | | | | + | +-->+ | | | | | | | | + | | 2-found affected | | | | | + | | +-->+ | | | | | | | + | | 3-marked host down| | | | | + | | +-->+ | | | | | | + | | 4-set VM error| | | | | + | | +--->+ | | | | | + | | | 5-notified VM error | | + | | | +----->| | | | | + | | | | 6-transformed event + | | | | +-->+ | | | + | | | | | 7-evaluated event + | | | | | +-->+ | | + | | | | | 8-fired alarm + | | | | | +-->+ | + | | | | | 9-received alarm + | | | | | +-->+ + sample | sample | | | |10-handled alarm + monitor| inspector |nova| c/m | aodh | + | | + +<-----------------doctor--------------->+ + +Note: c/m = ceilometer + +And a table of components sorted by time cost from most to least + ++----------+---------+----------+ +|Component |Time Cost|Percentage| ++==========+=========+==========+ +|inspector |160ms | 40% | ++----------+---------+----------+ +|aodh |110ms | 30% | ++----------+---------+----------+ +|monitor |50ms | 14% | ++----------+---------+----------+ +|... | | | ++----------+---------+----------+ +|... | | | ++----------+---------+----------+ + +Note: data in the table is for demonstration only, not actual measurement + +Timestamps can be collected from various sources + +1. log files +2. trace point in code + +The performance profiler will be integrated into the verification job to provide +detail result of the test. It can also be deployed independently to diagnose +performance issue in specified environment. + +Working Items +============= + +1. PoC with limited checkpoints +2. Integration with verification job +3. Collect timestamp at all checkpoints +4. Display the profiling result in console +5. Report the profiling result to test database +6. Independent package which can be installed to specified environment diff --git a/docs/development/design/port-data-plane-status.rst b/docs/development/design/port-data-plane-status.rst new file mode 100644 index 00000000..06cfc3c6 --- /dev/null +++ b/docs/development/design/port-data-plane-status.rst @@ -0,0 +1,180 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +==================================== +Port data plane status +==================================== + +https://bugs.launchpad.net/neutron/+bug/1598081 + +Neutron does not detect data plane failures affecting its logical resources. +This spec addresses that issue by means of allowing external tools to report to +Neutron about faults in the data plane that are affecting the ports. A new REST +API field is proposed to that end. + + +Problem Description +=================== + +An initial description of the problem was introduced in bug #159801 [1_]. This +spec focuses on capturing one (main) part of the problem there described, i.e. +extending Neutron's REST API to cover the scenario of allowing external tools +to report network failures to Neutron. Out of scope of this spec are works to +enable port status changes to be received and managed by mechanism drivers. + +This spec also tries to address bug #1575146 [2_]. Specifically, and argued by +the Neutron driver team in [3_]: + + * Neutron should not shut down the port completly upon detection of physnet + failure; connectivity between instances on the same node may still be + reachable. Externals tools may or may not want to trigger a status change on + the port based on their own logic and orchestration. + + * Port down is not detected when an uplink of a switch is down; + + * The physnet bridge may have multiple physical interfaces plugged; shutting + down the logical port may not be needed in case network redundancy is in + place. + + +Proposed Change +=============== + +A couple of possible approaches were proposed in [1_] (comment #3). This spec +proposes tackling the problema via a new extension API to the port resource. +The extension adds a new attribute 'dp-down' (data plane down) to represent the +status of the data plane. The field should be read-only by tenants and +read-write by admins. + +Neutron should send out an event to the message bus upon toggling the data +plane status value. The event is relevant for e.g. auditing. + + +Data Model Impact +----------------- + +A new attribute as extension will be added to the 'ports' table. + ++------------+-------+----------+---------+--------------------+--------------+ +|Attribute |Type |Access |Default |Validation/ |Description | +|Name | | |Value |Conversion | | ++============+=======+==========+=========+====================+==============+ +|dp_down |boolean|RO, tenant|False |True/False | | +| | |RW, admin | | | | ++------------+-------+----------+---------+--------------------+--------------+ + + +REST API Impact +--------------- + +A new API extension to the ports resource is going to be introduced. + +.. code-block:: python + + EXTENDED_ATTRIBUTES_2_0 = { + 'ports': { + 'dp_down': {'allow_post': False, 'allow_put': True, + 'default': False, 'convert_to': convert_to_boolean, + 'is_visible': True}, + }, + } + + +Examples +~~~~~~~~ + +Updating port data plane status to down: + +.. code-block:: json + + PUT /v2.0/ports/ + Accept: application/json + { + "port": { + "dp_down": true + } + } + + + +Command Line Client Impact +-------------------------- + +:: + + neutron port-update [--dp-down ] + openstack port set [--dp-down ] + +Argument --dp-down is optional. Defaults to False. + + +Security Impact +--------------- + +None + +Notifications Impact +-------------------- + +A notification (event) upon toggling the data plane status (i.e. 'dp-down' +attribute) value should be sent to the message bus. Such events do not happen +with high frequency and thus no negative impact on the notification bus is +expected. + +Performance Impact +------------------ + +None + +IPv6 Impact +----------- + +None + +Other Deployer Impact +--------------------- + +None + +Developer Impact +---------------- + +None + +Implementation +============== + +Assignee(s) +----------- + + * cgoncalves + +Work Items +---------- + + * New 'dp-down' attribute in 'ports' database table + * API extension to introduce new field to port + * Client changes to allow for data plane status (i.e. 'dp-down' attribute') + being set + * Policy (tenants read-only; admins read-write) + + +Documentation Impact +==================== + +Documentation for both administrators and end users will have to be +contemplated. Administrators will need to know how to set/unset the data plane +status field. + + +References +========== + +.. [1] RFE: Port status update, + https://bugs.launchpad.net/neutron/+bug/1598081 + +.. [2] RFE: ovs port status should the same as physnet + https://bugs.launchpad.net/neutron/+bug/1575146 + +.. [3] Neutron Drivers meeting, July 21, 2016 + http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-07-21-22.00.html diff --git a/docs/development/design/report-host-fault-to-update-server-state-immediately.rst b/docs/development/design/report-host-fault-to-update-server-state-immediately.rst new file mode 100644 index 00000000..2f6ce145 --- /dev/null +++ b/docs/development/design/report-host-fault-to-update-server-state-immediately.rst @@ -0,0 +1,248 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +.. NOTE:: + This is a specification draft of a blueprint proposed for OpenStack Nova + Liberty. It was written by project member(s) and agreed within the project + before submitting it upstream. No further changes to its content will be + made here anymore; please follow it upstream: + + * Current version upstream: https://review.openstack.org/#/c/169836/ + * Development activity: + https://blueprints.launchpad.net/nova/+spec/mark-host-down + + **Original draft is as follow:** + +==================================================== +Report host fault to update server state immediately +==================================================== + +https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately + +A new API is needed to report a host fault to change the state of the +instances and compute node immediately. This allows usage of evacuate API +without a delay. The new API provides the possibility for external monitoring +system to detect any kind of host failure fast and reliably and inform +OpenStack about it. Nova updates the compute node state and states of the +instances. This way the states in the Nova DB will be in sync with the +real state of the system. + +Problem description +=================== +* Nova state change for failed or unreachable host is slow and does not + reliably state compute node is down or not. This might cause same instance + to run twice if action taken to evacuate instance to another host. +* Nova state for instances on failed compute node will not change, + but remains active and running. This gives user a false information about + instance state. Currently one would need to call "nova reset-state" for each + instance to have them in error state. +* OpenStack user cannot make HA actions fast and reliably by trusting instance + state and compute node state. +* As compute node state changes slowly one cannot evacuate instances. + +Use Cases +--------- +Use case in general is that in case there is a host fault one should change +compute node state fast and reliably when using DB servicegroup backend. +On top of this here is the use cases that are not covered currently to have +instance states changed correctly: +* Management network connectivity lost between controller and compute node. +* Host HW failed. + +Generic use case flow: + +* The external monitoring system detects a host fault. +* The external monitoring system fences the host if not down already. +* The external system calls the new Nova API to force the failed compute node + into down state as well as instances running on it. +* Nova updates the compute node state and state of the effected instances to + Nova DB. + +Currently nova-compute state will be changing "down", but it takes a long +time. Server state keeps as "vm_state: active" and "power_state: +running", which is not correct. By having external tool to detect host faults +fast, fence host by powering down and then report host down to OpenStack, all +these states would reflect to actual situation. Also if OpenStack will not +implement automatic actions for fault correlation, external tool can do that. +This could be configured for example in server instance METADATA easily and be +read by external tool. + +Project Priority +----------------- +Liberty priorities have not yet been defined. + +Proposed change +=============== +There needs to be a new API for Admin to state host is down. This API is used +to mark compute node and instances running on it down to reflect the real +situation. + +Example on compute node is: + +* When compute node is up and running: + vm_state: active and power_state: running + nova-compute state: up status: enabled +* When compute node goes down and new API is called to state host is down: + vm_state: stopped power_state: shutdown + nova-compute state: down status: enabled + +vm_state values: soft-delete, deleted, resized and error +should not be touched. +task_state effect needs to be worked out if needs to be touched. + +Alternatives +------------ +There is no attractive alternatives to detect all different host faults than +to have a external tool to detect different host faults. For this kind of tool +to exist there needs to be new API in Nova to report fault. Currently there +must have been some kind of workarounds implemented as cannot trust or get the +states from OpenStack fast enough. + +Data model impact +----------------- +None + +REST API impact +--------------- +* Update CLI to report host is down + + nova host-update command + + usage: nova host-update [--status ] + [--maintenance ] + [--report-host-down] + + + Update host settings. + + Positional arguments + + + Name of host. + + Optional arguments + + --status + Either enable or disable a host. + + --maintenance + Either put or resume host to/from maintenance. + + --down + Report host down to update instance and compute node state in db. + +* Update Compute API to report host is down: + + /v2.1/{tenant_id}/os-hosts/{host_name} + + Normal response codes: 200 + Request parameters + + Parameter Style Type Description + host_name URI xsd:string The name of the host of interest to you. + + { + "host": { + "status": "enable", + "maintenance_mode": "enable" + "host_down_reported": "true" + + } + + } + + { + "host": { + "host": "65c5d5b7e3bd44308e67fc50f362aee6", + "maintenance_mode": "enabled", + "status": "enabled" + "host_down_reported": "true" + + } + + } + +* New method to nova.compute.api module HostAPI class to have a + to mark host related instances and compute node down: + set_host_down(context, host_name) + +* class novaclient.v2.hosts.HostManager(api) method update(host, values) + Needs to handle reporting host down. + +* Schema does not need changes as in db only service and server states are to + be changed. + +Security impact +--------------- +API call needs admin privileges (in the default policy configuration). + +Notifications impact +-------------------- +None + +Other end user impact +--------------------- +None + +Performance Impact +------------------ +Only impact is that user can get information faster about instance and +compute node state. This also gives possibility to evacuate faster. +No impact that would slow down. Host down should be rare occurrence. + +Other deployer impact +--------------------- +Developer can make use of any external tool to detect host fault and report it +to OpenStack. + +Developer impact +---------------- +None + +Implementation +============== +Assignee(s) +----------- +Primary assignee: Tomi Juvonen +Other contributors: Ryota Mibu + +Work Items +---------- +* Test cases. +* API changes. +* Documentation. + +Dependencies +============ +None + +Testing +======= +Test cases that exists for enabling or putting host to maintenance should be +altered or similar new cases made test new functionality. + +Documentation Impact +==================== + +New API needs to be documented: + +* Compute API extensions documentation. + http://developer.openstack.org/api-ref-compute-v2.1.html +* Nova commands documentation. + http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html +* Compute command-line client documentation. + http://docs.openstack.org/cli-reference/content/novaclient_commands.html +* nova.compute.api documentation. + http://docs.openstack.org/developer/nova/api/nova.compute.api.html +* High Availability guide might have page to tell external tool could provide + ability to provide faster HA as able to update states by new API. + http://docs.openstack.org/high-availability-guide/content/index.html + +References +========== +* OPNFV Doctor project: https://wiki.opnfv.org/doctor +* OpenStack Instance HA Proposal: + http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/ +* The Different Facets of OpenStack HA: + http://blog.russellbryant.net/2015/03/10/ + the-different-facets-of-openstack-ha/ diff --git a/docs/development/design/rfe-port-status-update.rst b/docs/development/design/rfe-port-status-update.rst new file mode 100644 index 00000000..d87d7d7b --- /dev/null +++ b/docs/development/design/rfe-port-status-update.rst @@ -0,0 +1,32 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +========================== +Neutron Port Status Update +========================== + +.. NOTE:: + This document represents a Neutron RFE reviewed in the Doctor project before submitting upstream to Launchpad Neutron + space. The document is not intended to follow a blueprint format or to be an extensive document. + For more information, please visit http://docs.openstack.org/developer/neutron/policies/blueprints.html + + The RFE was submitted to Neutron. You can follow the discussions in https://bugs.launchpad.net/neutron/+bug/1598081 + +Neutron port status field represents the current status of a port in the cloud infrastructure. The field can take one of +the following values: 'ACTIVE', 'DOWN', 'BUILD' and 'ERROR'. + +At present, if a network event occurs in the data-plane (e.g. virtual or physical switch fails or one of its ports, +cable gets pulled unintentionally, infrastructure topology changes, etc.), connectivity to logical ports may be affected +and tenants' services interrupted. When tenants/cloud administrators are looking up their resources' status (e.g. Nova +instances and services running in them, network ports, etc.), they will wrongly see everything looks fine. The problem +is that Neutron will continue reporting port 'status' as 'ACTIVE'. + +Many SDN Controllers managing network elements have the ability to detect and report network events to upper layers. +This allows SDN Controllers' users to be notified of changes and react accordingly. Such information could be consumed +by Neutron so that Neutron could update the 'status' field of those logical ports, and additionally generate a +notification message to the message bus. + +However, Neutron misses a way to be able to receive such information through e.g. ML2 driver or the REST API ('status' +field is read-only). There are pros and cons on both of these approaches as well as other possible approaches. This RFE +intends to trigger a discussion on how Neutron could be improved to receive fault/change events from SDN Controllers or +even also from 3rd parties not in charge of controlling the network (e.g. monitoring systems, human admins). diff --git a/docs/development/index.rst b/docs/development/index.rst new file mode 100644 index 00000000..71974fe6 --- /dev/null +++ b/docs/development/index.rst @@ -0,0 +1,21 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) 2016 OPNFV. + + +====== +Doctor +====== + +.. toctree:: + :maxdepth: 2 + :numbered: + + ./design/index.rst + ./requirements/index.rst + ./manuals/index.rst + ./overview/functest_scenario/index.rst + +Indices +======= +* :ref:`search` diff --git a/docs/development/manuals/get-valid-server-state.rst b/docs/development/manuals/get-valid-server-state.rst new file mode 100644 index 00000000..824ea3c2 --- /dev/null +++ b/docs/development/manuals/get-valid-server-state.rst @@ -0,0 +1,125 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +====================== +Get valid server state +====================== + +Related Blueprints: +=================== + +https://blueprints.launchpad.net/nova/+spec/get-valid-server-state + +Problem description +=================== + +Previously when the owner of a VM has queried his VMs, he has not received +enough state information, states have not changed fast enough in the VIM and +they have not been accurate in some scenarios. With this change this gap is now +closed. + +A typical case is that, in case of a fault of a host, the user of a high +availability service running on top of that host, needs to make an immediate +switch over from the faulty host to an active standby host. Now, if the compute +host is forced down [1] as a result of that fault, the user has to be notified +about this state change such that the user can react accordingly. Similarly, +a change of the host state to "maintenance" should also be notified to the +users. + +What is changed +=============== + +A new ``host_status`` parameter is added to the ``/servers/{server_id}`` and +``/servers/detail`` endpoints in microversion 2.16. By this new parameter +user can get additional state information about the host. + +``host_status`` possible values where next value in list can override the +previous: + +- ``UP`` if nova-compute is up. +- ``UNKNOWN`` if nova-compute status was not reported by servicegroup driver + within configured time period. Default is within 60 seconds, + but can be changed with ``service_down_time`` in nova.conf. +- ``DOWN`` if nova-compute was forced down. +- ``MAINTENANCE`` if nova-compute was disabled. MAINTENANCE in API directly + means nova-compute service is disabled. Different wording is used to avoid + the impression that the whole host is down, as only scheduling of new VMs + is disabled. +- Empty string indicates there is no host for server. + +``host_status`` is returned in the response in case the policy permits. By +default the policy is for admin only in Nova policy.json:: + + "os_compute_api:servers:show:host_status": "rule:admin_api" + +For an NFV use case this has to also be enabled for the owner of the VM:: + + "os_compute_api:servers:show:host_status": "rule:admin_or_owner" + +REST API examples: +================== + +Case where nova-compute is enabled and reporting normally:: + + GET /v2.1/{tenant_id}/servers/{server_id} + + 200 OK + { + "server": { + "host_status": "UP", + ... + } + } + +Case where nova-compute is enabled, but not reporting normally:: + + GET /v2.1/{tenant_id}/servers/{server_id} + + 200 OK + { + "server": { + "host_status": "UNKNOWN", + ... + } + } + +Case where nova-compute is enabled, but forced_down:: + + GET /v2.1/{tenant_id}/servers/{server_id} + + 200 OK + { + "server": { + "host_status": "DOWN", + ... + } + } + +Case where nova-compute is disabled:: + + GET /v2.1/{tenant_id}/servers/{server_id} + + 200 OK + { + "server": { + "host_status": "MAINTENANCE", + ... + } + } + +Host Status is also visible in python-novaclient:: + + +-------+------+--------+------------+-------------+----------+-------------+ + | ID | Name | Status | Task State | Power State | Networks | Host Status | + +-------+------+--------+------------+-------------+----------+-------------+ + | 9a... | vm1 | ACTIVE | - | RUNNING | xnet=... | UP | + +-------+------+--------+------------+-------------+----------+-------------+ + +Links: +====== + +[1] Manual for OpenStack NOVA API for marking host down +http://artifacts.opnfv.org/doctor/docs/manuals/mark-host-down_manual.html + +[2] OpenStack compute manual page +http://developer.openstack.org/api-ref-compute-v2.1.html#compute-v2.1 diff --git a/docs/development/manuals/index.rst b/docs/development/manuals/index.rst new file mode 100644 index 00000000..05831b2b --- /dev/null +++ b/docs/development/manuals/index.rst @@ -0,0 +1,13 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +******* +Manuals +******* + +.. toctree:: + :numbered: + :maxdepth: 2 + +.. include:: mark-host-down_manual.rst +.. include:: get-valid-server-state.rst diff --git a/docs/development/manuals/mark-host-down_manual.rst b/docs/development/manuals/mark-host-down_manual.rst new file mode 100644 index 00000000..3815205d --- /dev/null +++ b/docs/development/manuals/mark-host-down_manual.rst @@ -0,0 +1,122 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +========================================= +OpenStack NOVA API for marking host down. +========================================= + +Related Blueprints: +=================== + + https://blueprints.launchpad.net/nova/+spec/mark-host-down + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service + +What the API is for +=================== + + This API will give external fault monitoring system a possibility of telling + OpenStack Nova fast that compute host is down. This will immediately enable + calling of evacuation of any VM on host and further enabling faster HA + actions. + +What this API does +================== + + In OpenStack the nova-compute service state can represent the compute host + state and this new API is used to force this service down. It is assumed + that the one calling this API has made sure the host is also fenced or + powered down. This is important, so there is no chance same VM instance will + appear twice in case evacuated to new compute host. When host is recovered + by any means, the external system is responsible of calling the API again to + disable forced_down flag and let the host nova-compute service report again + host being up. If network fenced host come up again it should not boot VMs + it had if figuring out they are evacuated to other compute host. The + decision of deleting or booting VMs there used to be on host should be + enhanced later to be more reliable by Nova blueprint: + https://blueprints.launchpad.net/nova/+spec/robustify-evacuate + +REST API for forcing down: +========================== + + Parameter explanations: + tenant_id: Identifier of the tenant. + binary: Compute service binary name. + host: Compute host name. + forced_down: Compute service forced down flag. + token: Token received after successful authentication. + service_host_ip: Serving controller node ip. + + request: + PUT /v2.1/{tenant_id}/os-services/force-down + { + "binary": "nova-compute", + "host": "compute1", + "forced_down": true + } + + response: + 200 OK + { + "service": { + "host": "compute1", + "binary": "nova-compute", + "forced_down": true + } + } + + Example: + curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services + /force-down -H "Content-Type: application/json" -H "Accept: application/json + " -H "X-OpenStack-Nova-API-Version: 2.11" -H "X-Auth-Token: {token}" -d '{"b + inary": "nova-compute", "host": "compute1", "forced_down": true}' + +CLI for forcing down: +===================== + + nova service-force-down nova-compute + + Example: + nova service-force-down compute1 nova-compute + +REST API for disabling forced down: +=================================== + + Parameter explanations: + tenant_id: Identifier of the tenant. + binary: Compute service binary name. + host: Compute host name. + forced_down: Compute service forced down flag. + token: Token received after successful authentication. + service_host_ip: Serving controller node ip. + + request: + PUT /v2.1/{tenant_id}/os-services/force-down + { + "binary": "nova-compute", + "host": "compute1", + "forced_down": false + } + + response: + 200 OK + { + "service": { + "host": "compute1", + "binary": "nova-compute", + "forced_down": false + } + } + + Example: + curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services + /force-down -H "Content-Type: application/json" -H "Accept: application/json + " -H "X-OpenStack-Nova-API-Version: 2.11" -H "X-Auth-Token: {token}" -d '{"b + inary": "nova-compute", "host": "compute1", "forced_down": false}' + +CLI for disabling forced down: +============================== + + nova service-force-down --unset nova-compute + + Example: + nova service-force-down --unset compute1 nova-compute diff --git a/docs/development/overview/functest_scenario/doctor-scenario-in-functest.rst b/docs/development/overview/functest_scenario/doctor-scenario-in-functest.rst new file mode 100644 index 00000000..b3d73d5c --- /dev/null +++ b/docs/development/overview/functest_scenario/doctor-scenario-in-functest.rst @@ -0,0 +1,126 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + + + +Platform overview +""""""""""""""""" + +Doctor platform provides these features in `Danube Release `_: + +* Immediate Notification +* Consistent resource state awareness for compute host down +* Valid compute host status given to VM owner + +These features enable high availability of Network Services on top of +the virtualized infrastructure. Immediate notification allows VNF managers +(VNFM) to process recovery actions promptly once a failure has occurred. + +Consistency of resource state is necessary to execute recovery actions +properly in the VIM. + +Ability to query host status gives VM owner the possibility to get +consistent state information through an API in case of a compute host +fault. + +The Doctor platform consists of the following components: + +* OpenStack Compute (Nova) +* OpenStack Telemetry (Ceilometer) +* OpenStack Alarming (Aodh) +* Doctor Inspector +* Doctor Monitor + +.. note:: + Doctor Inspector and Monitor are sample implementations for reference. + +You can see an overview of the Doctor platform and how components interact in +:numref:`figure-p1`. + +.. figure:: ./images/figure-p1.png + :name: figure-p1 + :width: 100% + + Doctor platform and typical sequence + +Detailed information on the Doctor architecture can be found in the Doctor +requirements documentation: +http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html + +Use case +"""""""" + +* A consumer of the NFVI wants to receive immediate notifications about faults + in the NFVI affecting the proper functioning of the virtual resources. + Therefore, such faults have to be detected as quickly as possible, and, when + a critical error is observed, the affected consumer is immediately informed + about the fault and can switch over to the STBY configuration. + +The faults to be monitored (and at which detection rate) will be configured by +the consumer. Once a fault is detected, the Inspector in the Doctor +architecture will check the resource map maintained by the Controller, to find +out which virtual resources are affected and then update the resources state. +The Notifier will receive the failure event requests sent from the Controller, +and notify the consumer(s) of the affected resources according to the alarm +configuration. + +Detailed workflow information is as follows: + +* Consumer(VNFM): (step 0) creates resources (network, server/instance) and an + event alarm on state down notification of that server/instance + +* Monitor: (step 1) periodically checks nodes, such as ping from/to each + dplane nic to/from gw of node, (step 2) once it fails to send out event + with "raw" fault event information to Inspector + +* Inspector: when it receives an event, it will (step 3) mark the host down + ("mark-host-down"), (step 4) map the PM to VM, and change the VM status to + down + +* Controller: (step 5) sends out instance update event to Ceilometer + +* Notifier: (step 6) Ceilometer transforms and passes the event to Aodh, + (step 7) Aodh will evaluate event with the registered alarm definitions, + then (step 8) it will fire the alarm to the "consumer" who owns the + instance + +* Consumer(VNFM): (step 9) receives the event and (step 10) recreates a new + instance + +Test case +""""""""" + +Functest will call the "run.sh" script in Doctor to run the test job. + +Currently, only 'Apex' and 'local' installer are supported. The test also +can run successfully in 'fuel' installer with the modification of some +configurations of OpenStack in the script. But still need 'fuel' installer +to support these configurations. + +The "run.sh" script will execute the following steps. + +Firstly, get the installer ip according to the installer type. Then ssh to +the installer node to get the private key for accessing to the cloud. As +'fuel' installer, ssh to the controller node to modify nova and ceilometer +configurations. + +Secondly, prepare image for booting VM, then create a test project and test +user (both default to doctor) for the Doctor tests. + +Thirdly, boot a VM under the doctor project and check the VM status to verify +that the VM is launched completely. Then get the compute host info where the VM +is launched to verify connectivity to the target compute host. Get the consumer +ip according to the route to compute ip and create an alarm event in Ceilometer +using the consumer ip. + +Fourthly, the Doctor components are started, and, based on the above preparation, +a failure is injected to the system, i.e. the network of compute host is +disabled for 3 minutes. To ensure the host is down, the status of the host +will be checked. + +Finally, the notification time, i.e. the time between the execution of step 2 +(Monitor detects failure) and step 9 (Consumer receives failure notification) +is calculated. + +According to the Doctor requirements, the Doctor test is successful if the +notification time is below 1 second. diff --git a/docs/development/overview/functest_scenario/images/LICENSE b/docs/development/overview/functest_scenario/images/LICENSE new file mode 100644 index 00000000..21a2d03d --- /dev/null +++ b/docs/development/overview/functest_scenario/images/LICENSE @@ -0,0 +1,14 @@ +Copyright 2017 Open Platform for NFV Project, Inc. and its contributors + +Open Platform for NFV Project Documentation License +=================================================== +Any documentation developed by the "Open Platform for NFV Project" +is licensed under a Creative Commons Attribution 4.0 International License. +You should have received a copy of the license along with this. If not, +see . + +Unless required by applicable law or agreed to in writing, documentation +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/docs/development/overview/functest_scenario/images/figure-p1.png b/docs/development/overview/functest_scenario/images/figure-p1.png new file mode 100755 index 00000000..e963d8bd Binary files /dev/null and b/docs/development/overview/functest_scenario/images/figure-p1.png differ diff --git a/docs/development/requirements/01-intro.rst b/docs/development/requirements/01-intro.rst new file mode 100644 index 00000000..ed666cd1 --- /dev/null +++ b/docs/development/requirements/01-intro.rst @@ -0,0 +1,51 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Introduction +============ + +The goal of this project is to build an NFVI fault management and maintenance +framework supporting high availability of the Network Services on top of the +virtualized infrastructure. The key feature is immediate notification of +unavailability of virtualized resources from VIM, to support failure recovery, +or failure avoidance of VNFs running on them. Requirement survey and development +of missing features in NFVI and VIM are in scope of this project in order to +fulfil requirements for fault management and maintenance in NFV. + +The purpose of this requirement project is to clarify the necessary features of +NFVI fault management, and maintenance, identify missing features in the current +OpenSource implementations, provide a potential implementation architecture and +plan, provide implementation guidelines in relevant upstream projects to realize +those missing features, and define the VIM northbound interfaces necessary to +perform the task of NFVI fault management, and maintenance in alignment with +ETSI NFV [ENFV]_. + +Problem description +------------------- + +A Virtualized Infrastructure Manager (VIM), e.g. OpenStack [OPSK]_, cannot +detect certain Network Functions Virtualization Infrastructure (NFVI) faults. +This feature is necessary to detect the faults and notify the Consumer in order +to ensure the proper functioning of EPC VNFs like MME and S/P-GW. + +* EPC VNFs are often in active standby (ACT-STBY) configuration and need to + switch from STBY mode to ACT mode as soon as relevant faults are detected in + the active (ACT) VNF. + +* NFVI encompasses all elements building up the environment in which VNFs are + deployed, e.g., Physical Machines, Hypervisors, Storage, and Network elements. + +In addition, VIM, e.g. OpenStack, needs to receive maintenance instructions from +the Consumer, i.e. the operator/administrator of the VNF. + +* Change the state of certain Physical Machines (PMs), e.g. empty the PM, so + that maintenance work can be performed at these machines. + +Note: Although fault management and maintenance are different operations in NFV, +both are considered as part of this project as -- except for the trigger -- they +share a very similar work and message flow. Hence, from implementation +perspective, these two are kept together in the Doctor project because of this +high degree of similarity. + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/02-use_cases.rst b/docs/development/requirements/02-use_cases.rst new file mode 100644 index 00000000..0a1f6413 --- /dev/null +++ b/docs/development/requirements/02-use_cases.rst @@ -0,0 +1,195 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Use cases and scenarios +======================= + +Telecom services often have very high requirements on service performance. As a +consequence they often utilize redundancy and high availability (HA) mechanisms +for both the service and the platform. The HA support may be built-in or +provided by the platform. In any case, the HA support typically has a very fast +detection and reaction time to minimize service impact. The main changes +proposed in this document are about making a clear distinction between fault +management and recovery a) within the VIM/NFVI and b) High Availability support +for VNFs on the other, claiming that HA support within a VNF or as a service +from the platform is outside the scope of Doctor and is discussed in the High +Availability for OPNFV project. Doctor should focus on detecting and remediating +faults in the NFVI. This will ensure that applications come back to a fully +redundant configuration faster than before. + +As an example, Telecom services can come with an Active-Standby (ACT-STBY) +configuration which is a (1+1) redundancy scheme. ACT and STBY nodes (aka +Physical Network Function (PNF) in ETSI NFV terminology) are in a hot standby +configuration. If an ACT node is unable to function properly due to fault or any +other reason, the STBY node is instantly made ACT, and affected services can be +provided without any service interruption. + +The ACT-STBY configuration needs to be maintained. This means, when a STBY node +is made ACT, either the previously ACT node, after recovery, shall be made STBY, +or, a new STBY node needs to be configured. The actual operations to +instantiate/configure a new STBY are similar to instantiating a new VNF and +therefore are outside the scope of this project. + +The NFVI fault management and maintenance requirements aim at providing fast +failure detection of physical and virtualized resources and remediation of the +virtualized resources provided to Consumers according to their predefined +request to enable applications to recover to a fully redundant mode of +operation. + +1. Fault management/recovery using ACT-STBY configuration (Triggered by critical + error) +2. Preventive actions based on fault prediction (Preventing service stop by + handling warnings) +3. VM Retirement (Managing service during NFVI maintenance, i.e. H/W, + Hypervisor, Host OS, maintenance) + +Faults +------ + +.. _uc-fault1: + +Fault management using ACT-STBY configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In :numref:`figure1`, a system-wide view of relevant functional blocks is +presented. OpenStack is considered as the VIM implementation (aka Controller) +which has interfaces with the NFVI and the Consumers. The VNF implementation is +represented as different virtual resources marked by different colors. Consumers +(VNFM or NFVO in ETSI NFV terminology) own/manage the respective virtual +resources (VMs in this example) shown with the same colors. + +The first requirement in this use case is that the Controller needs to detect +faults in the NFVI ("1. Fault Notification" in :numref:`figure1`) affecting +the proper functioning of the virtual resources (labelled as VM-x) running on +top of it. It should be possible to configure which relevant fault items should +be detected. The VIM (e.g. OpenStack) itself could be extended to detect such +faults. Alternatively, a third party fault monitoring tool could be used which +then informs the VIM about such faults; this third party fault monitoring +element can be considered as a component of VIM from an architectural point of +view. + +Once such fault is detected, the VIM shall find out which virtual resources are +affected by this fault. In the example in :numref:`figure1`, VM-4 is +affected by a fault in the Hardware Server-3. Such mapping shall be maintained +in the VIM, depicted as the "Server-VM info" table inside the VIM. + +Once the VIM has identified which virtual resources are affected by the fault, +it needs to find out who is the Consumer (i.e. the owner/manager) of the +affected virtual resources (Step 2). In the example shown in :numref:`figure1`, +the VIM knows that for the red VM-4, the manager is the red Consumer +through an Ownership info table. The VIM then notifies (Step 3 "Fault +Notification") the red Consumer about this fault, preferably with sufficient +abstraction rather than detailed physical fault information. + +.. figure:: images/figure1.png + :name: figure1 + :width: 100% + + Fault management/recovery use case + +The Consumer then switches to STBY configuration by switching the STBY node to +ACT state (Step 4). It further initiates a process to instantiate/configure a +new STBY. However, switching to STBY mode and creating a new STBY machine is a +VNFM/NFVO level operation and therefore outside the scope of this project. +Doctor project does not create interfaces for such VNFM level configuration +operations. Yet, since the total failover time of a consumer service depends on +both the delay of such processes as well as the reaction time of Doctor +components, minimizing Doctor's reaction time is a necessary basic ingredient to +fast failover times in general. + +Once the Consumer has switched to STBY configuration, it notifies (Step 5 +"Instruction" in :numref:`figure1`) the VIM. The VIM can then take +necessary (e.g. pre-determined by the involved network operator) actions on how +to clean up the fault affected VMs (Step 6 "Execute Instruction"). + +The key issue in this use case is that a VIM (OpenStack in this context) shall +not take a standalone fault recovery action (e.g. migration of the affected VMs) +before the ACT-STBY switching is complete, as that might violate the ACT-STBY +configuration and render the node out of service. + +As an extension of the 1+1 ACT-STBY resilience pattern, a STBY instance can act as +backup to N ACT nodes (N+1). In this case, the basic information flow remains +the same, i.e., the consumer is informed of a failure in order to activate the +STBY node. However, in this case it might be useful for the failure notification +to cover a number of failed instances due to the same fault (e.g., more than one +instance might be affected by a switch failure). The reaction of the consumer +might depend on whether only one active instance has failed (similar to the +ACT-STBY case), or if more active instances are needed as well. + +Preventive actions based on fault prediction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The fault management scenario explained in :ref:`uc-fault1` can also be +performed based on fault prediction. In such cases, in VIM, there is an +intelligent fault prediction module which, based on its NFVI monitoring +information, can predict an imminent fault in the elements of NFVI. +A simple example is raising temperature of a Hardware Server which might +trigger a pre-emptive recovery action. The requirements of such fault +prediction in the VIM are investigated in the OPNFV project "Data Collection +for Failure Prediction" [PRED]_. + +This use case is very similar to :ref:`uc-fault1`. Instead of a fault +detection (Step 1 "Fault Notification in" :numref:`figure1`), the trigger +comes from a fault prediction module in the VIM, or from a third party module +which notifies the VIM about an imminent fault. From Step 2~5, the work flow is +the same as in the "Fault management using ACT-STBY configuration" use case, +except in this case, the Consumer of a VM/VNF switches to STBY configuration +based on a predicted fault, rather than an occurred fault. + +NFVI Maintenance +---------------- + +VM Retirement +^^^^^^^^^^^^^ + +All network operators perform maintenance of their network infrastructure, both +regularly and irregularly. Besides the hardware, virtualization is expected to +increase the number of elements subject to such maintenance as NFVI holds new +elements like the hypervisor and host OS. Maintenance of a particular resource +element e.g. hardware, hypervisor etc. may render a particular server hardware +unusable until the maintenance procedure is complete. + +However, the Consumer of VMs needs to know that such resources will be +unavailable because of NFVI maintenance. The following use case is again to +ensure that the ACT-STBY configuration is not violated. A stand-alone action +(e.g. live migration) from VIM/OpenStack to empty a physical machine so that +consequent maintenance procedure could be performed may not only violate the +ACT-STBY configuration, but also have impact on real-time processing scenarios +where dedicated resources to virtual resources (e.g. VMs) are necessary and a +pause in operation (e.g. vCPU) is not allowed. The Consumer is in a position to +safely perform the switch between ACT and STBY nodes, or switch to an +alternative VNF forwarding graph so the hardware servers hosting the ACT nodes +can be emptied for the upcoming maintenance operation. Once the target hardware +servers are emptied (i.e. no virtual resources are running on top), the VIM can +mark them with an appropriate flag (i.e. "maintenance" state) such that these +servers are not considered for hosting of virtual machines until the maintenance +flag is cleared (i.e. nodes are back in "normal" status). + +A high-level view of the maintenance procedure is presented in :numref:`figure2`. +VIM/OpenStack, through its northbound interface, receives a maintenance notification +(Step 1 "Maintenance Request") from the Administrator (e.g. a network operator) +including information about which hardware is subject to maintenance. +Maintenance operations include replacement/upgrade of hardware, +update/upgrade of the hypervisor/host OS, etc. + +The consequent steps to enable the Consumer to perform ACT-STBY switching are +very similar to the fault management scenario. From VIM/OpenStack's internal +database, it finds out which virtual resources (VM-x) are running on those +particular Hardware Servers and who are the managers of those virtual resources +(Step 2). The VIM then informs the respective Consumer (VNFMs or NFVO) in Step 3 +"Maintenance Notification". Based on this, the Consumer takes necessary actions +(Step 4, e.g. switch to STBY configuration or switch VNF forwarding graphs) and +then notifies (Step 5 "Instruction") the VIM. Upon receiving such notification, +the VIM takes necessary actions (Step 6 "Execute Instruction" to empty the +Hardware Servers so that consequent maintenance operations could be performed. +Due to the similarity for Steps 2~6, the maintenance procedure and the fault +management procedure are investigated in the same project. + +.. figure:: images/figure2.png + :name: figure2 + :width: 100% + + Maintenance use case + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/03-architecture.rst b/docs/development/requirements/03-architecture.rst new file mode 100644 index 00000000..b7417691 --- /dev/null +++ b/docs/development/requirements/03-architecture.rst @@ -0,0 +1,340 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +High level architecture and general features +============================================ + +Functional overview +------------------- + +The Doctor project circles around two distinct use cases: 1) management of +failures of virtualized resources and 2) planned maintenance, e.g. migration, of +virtualized resources. Both of them may affect a VNF/application and the network +service it provides, but there is a difference in frequency and how they can be +handled. + +Failures are spontaneous events that may or may not have an impact on the +virtual resources. The Consumer should as soon as possible react to the failure, +e.g., by switching to the STBY node. The Consumer will then instruct the VIM on +how to clean up or repair the lost virtual resources, i.e. restore the VM, VLAN +or virtualized storage. How much the applications are affected varies. +Applications with built-in HA support might experience a short decrease in +retainability (e.g. an ongoing session might be lost) while keeping availability +(establishment or re-establishment of sessions are not affected), whereas the +impact on applications without built-in HA may be more serious. How much the +network service is impacted depends on how the service is implemented. With +sufficient network redundancy the service may be unaffected even when a specific +resource fails. + +On the other hand, planned maintenance impacting virtualized resources are events +that are known in advance. This group includes e.g. migration due to software +upgrades of OS and hypervisor on a compute host. Some of these might have been +requested by the application or its management solution, but there is also a +need for coordination on the actual operations on the virtual resources. There +may be an impact on the applications and the service, but since they are not +spontaneous events there is room for planning and coordination between the +application management organization and the infrastructure management +organization, including performing whatever actions that would be required to +minimize the problems. + +Failure prediction is the process of pro-actively identifying situations that +may lead to a failure in the future unless acted on by means of maintenance +activities. From applications' point of view, failure prediction may impact them +in two ways: either the warning time is so short that the application or its +management solution does not have time to react, in which case it is equal to +the failure scenario, or there is sufficient time to avoid the consequences by +means of maintenance activities, in which case it is similar to planned +maintenance. + +Architecture Overview +--------------------- + +NFV and the Cloud platform provide virtual resources and related control +functionality to users and administrators. :numref:`figure3` shows the high +level architecture of NFV focusing on the NFVI, i.e., the virtualized +infrastructure. The NFVI provides virtual resources, such as virtual machines +(VM) and virtual networks. Those virtual resources are used to run applications, +i.e. VNFs, which could be components of a network service which is managed by +the consumer of the NFVI. The VIM provides functionalities of controlling and +viewing virtual resources on hardware (physical) resources to the consumers, +i.e., users and administrators. OpenStack is a prominent candidate for this VIM. +The administrator may also directly control the NFVI without using the VIM. + +Although OpenStack is the target upstream project where the new functional +elements (Controller, Notifier, Monitor, and Inspector) are expected to be +implemented, a particular implementation method is not assumed. Some of these +elements may sit outside of OpenStack and offer a northbound interface to +OpenStack. + +General Features and Requirements +--------------------------------- + +The following features are required for the VIM to achieve high availability of +applications (e.g., MME, S/P-GW) and the Network Services: + +1. Monitoring: Monitor physical and virtual resources. +2. Detection: Detect unavailability of physical resources. +3. Correlation and Cognition: Correlate faults and identify affected virtual + resources. +4. Notification: Notify unavailable virtual resources to their Consumer(s). +5. Fencing: Shut down or isolate a faulty resource. +6. Recovery action: Execute actions to process fault recovery and maintenance. + +The time interval between the instant that an event is detected by the +monitoring system and the Consumer notification of unavailable resources shall +be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4`). + +.. figure:: images/figure3.png + :name: figure3 + :width: 100% + + High level architecture + +Monitoring +^^^^^^^^^^ + +The VIM shall monitor physical and virtual resources for unavailability and +suspicious behavior. + +Detection +^^^^^^^^^ + +The VIM shall detect unavailability and failures of physical resources that +might cause errors/faults in virtual resources running on top of them. +Unavailability of physical resource is detected by various monitoring and +managing tools for hardware and software components. This may include also +predicting upcoming faults. Note, fault prediction is out of scope of this +project and is investigated in the OPNFV "Data Collection for Failure +Prediction" project [PRED]_. + +The fault items/events to be detected shall be configurable. + +The configuration shall enable Failure Selection and Aggregation. Failure +aggregation means the VIM determines unavailability of physical resource from +more than two non-critical failures related to the same resource. + +There are two types of unavailability - immediate and future: + +* Immediate unavailability can be detected by setting traps of raw failures on + hardware monitoring tools. +* Future unavailability can be found by receiving maintenance instructions + issued by the administrator of the NFVI or by failure prediction mechanisms. + +Correlation and Cognition +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The VIM shall correlate each fault to the impacted virtual resource, i.e., the +VIM shall identify unavailability of virtualized resources that are or will be +affected by failures on the physical resources under them. Unavailability of a +virtualized resource is determined by referring to the mapping of physical and +virtualized resources. + +VIM shall allow configuration of fault correlation between physical and +virtual resources. VIM shall support correlating faults: + +* between a physical resource and another physical resource +* between a physical resource and a virtual resource +* between a virtual resource and another virtual resource + +Failure aggregation is also required in this feature, e.g., a user may request +to be only notified if failures on more than two standby VMs in an (N+M) +deployment model occurred. + +Notification +^^^^^^^^^^^^ + +The VIM shall notify the alarm, i.e., unavailability of virtual resource(s), to +the Consumer owning it over the northbound interface, such that the Consumers +impacted by the failure can take appropriate actions to recover from the +failure. + +The VIM shall also notify the unavailability of physical resources to its +Administrator. + +All notifications shall be transferred immediately in order to minimize the +stalling time of the network service and to avoid over assignment caused by +delay of capability updates. + +There may be multiple consumers, so the VIM has to find out the owner of a +faulty resource. Moreover, there may be a large number of virtual and physical +resources in a real deployment, so polling the state of all resources to the VIM +would lead to heavy signaling traffic. Thus, a publication/subscription +messaging model is better suited for these notifications, as notifications are +only sent to subscribed consumers. + +Notifications will be send out along with the configuration by the consumer. +The configuration includes endpoint(s) in which the consumers can specify +multiple targets for the notification subscription, so that various and +multiple receiver functions can consume the notification message. +Also, the conditions for notifications shall be configurable, such that +the consumer can set according policies, e.g. whether it wants to receive +fault notifications or not. + +Note: the VIM should only accept notification subscriptions for each resource +by its owner or administrator. +Notifications to the Consumer about the unavailability of virtualized +resources will include a description of the fault, preferably with sufficient +abstraction rather than detailed physical fault information. + +.. _fencing: + +Fencing +^^^^^^^ +Recovery actions, e.g. safe VM evacuation, have to be preceded by fencing the +failed host. Fencing hereby means to isolate or shut down a faulty resource. +Without fencing -- when the perceived disconnection is due to some transient +or partial failure -- the evacuation might lead into two identical instances +running together and having a dangerous conflict. + +There is a cross-project definition in OpenStack of how to implement +fencing, but there has not been any progress. The general description is +available here: +https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host + +OpenStack provides some mechanisms that allow fencing of faulty resources. Some +are automatically invoked by the platform itself (e.g. Nova disables the +compute service when libvirtd stops running, preventing new VMs to be scheduled +to that node), while other mechanisms are consumer trigger-based actions (e.g. +Neutron port admin-state-up). For other fencing actions not supported by +OpenStack, the Doctor project may suggest ways to address the gap (e.g. through +means of resourcing to external tools and orchestration methods), or +documenting or implementing them upstream. + +The Doctor Inspector component will be responsible of marking resources down in +the OpenStack and back up if necessary. + +Recovery Action +^^^^^^^^^^^^^^^ + +In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by +the VIM, but all recovery actions executed by the VIM and the NFVI will be +instructed and coordinated by the Consumer. + +In a more advanced use case, the VIM may be able to recover the failed virtual +resources according to a pre-defined behavior for that resource. In principle +this means that the owner of the resource (i.e., its consumer or administrator) +can define which recovery actions shall be taken by the VIM. Examples are a +restart of the VM or migration/evacuation of the VM. + + + +High level northbound interface specification +--------------------------------------------- + +Fault Management +^^^^^^^^^^^^^^^^ + +This interface allows the Consumer to subscribe to fault notification from the +VIM. Using a filter, the Consumer can narrow down which faults should be +notified. A fault notification may trigger the Consumer to switch from ACT to +STBY configuration and initiate fault recovery actions. A fault query +request/response message exchange allows the Consumer to find out about active +alarms at the VIM. A filter can be used to narrow down the alarms returned in +the response message. + +.. figure:: images/figure4.png + :name: figure4 + :width: 100% + + High-level message flow for fault management + +The high level message flow for the fault management use case is shown in +:numref:`figure4`. +It consists of the following steps: + +1. The VIM monitors the physical and virtual resources and the fault management + workflow is triggered by a monitored fault event. +2. Event correlation, fault detection and aggregation in VIM. Note: this may + also happen after Step 3. +3. Database lookup to find the virtual resources affected by the detected fault. +4. Fault notification to Consumer. +5. The Consumer switches to standby configuration (STBY). +6. Instructions to VIM requesting certain actions to be performed on the + affected resources, for example migrate/update/terminate specific + resource(s). After reception of such instructions, the VIM is executing the + requested action, e.g., it will migrate or terminate a virtual resource. + +NFVI Maintenance +^^^^^^^^^^^^^^^^ + +The NFVI maintenance interface allows the Administrator to notify the VIM about +a planned maintenance operation on the NFVI. A maintenance operation may for +example be an update of the server firmware or the hypervisor. The +MaintenanceRequest message contains instructions to change the state of the +physical resource from 'enabled' to 'going-to-maintenance' and a timeout [#timeout]_. +After receiving the MaintenanceRequest,the VIM decides on the actions to be taken +based on maintenance policies predefined by the affected Consumer(s). + +.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time + to empty the physical resources. + +.. figure:: images/figure5a.png + :name: figure5a + :width: 100% + + High-level message flow for maintenance policy enforcement + +The high level message flow for the NFVI maintenance policy enforcement is shown +in :numref:`figure5a`. It consists of the following steps: + +1. Maintenance trigger received from Administrator. +2. VIM switches the affected physical resources to "going-to-maintenance" state e.g. so that no new + VM will be scheduled on the physical servers. +3. Database lookup to find the Consumer(s) and virtual resources affected by the maintenance + operation. +4. Maintenance policies are enforced in the VIM, e.g. affected VM(s) are shut down + on the physical server(s), or affected Consumer(s) are notified about the planned + maintenance operation (steps 4a/4b). + + +Once the affected Consumer(s) have been notified, they take specific actions (e.g. switch to standby +(STBY) configuration, request to terminate the virtual resource(s)) to allow the maintenance +action to be executed. After the physical resources have been emptied, the VIM puts the physical +resources in "in-maintenance" state and sends a MaintenanceResponse back to the Administrator. + +.. figure:: images/figure5b.png + :name: figure5b + :width: 100% + + Successful NFVI maintenance + +The high level message flow for a successful NFVI maintenance is show in :numref:`figure5b`. +It consists of the following steps: + +5. The Consumer C3 switches to standby configuration (STBY). +6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed + (steps 6a, 6b). After receiving such instructions, the VIM executes the requested + action in order to empty the physical resources (step 6c) and informs the + Consumer about the result of the actions (steps 6d, 6e). +7. The VIM switches the physical resources to "in-maintenance" state +8. Maintenance response is sent from VIM to inform the Administrator that the physical + servers have been emptied. +9. The Administrator is coordinating and executing the maintenance + operation/work on the NFVI. Note: this step is out of scope of Doctor project. + +The requested actions to empty the physical resources may not be successful (e.g. migration fails +or takes too long) and in such a case, the VIM puts the physical resources back to 'enabled' and +informs the Administrator about the problem. + +.. figure:: images/figure5c.png + :name: figure5c + :width: 100% + + Example of failed NFVI maintenance + +An example of a high level message flow to cover the failed NFVI maintenance case is +shown in :numref:`figure5c`. +It consists of the following steps: + +5. The Consumer C3 switches to standby configuration (STBY). +6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed + (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2 + (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout. +7. The VIM switches the physical resources to "enabled" state. +8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action + cannot start. + + +.. + vim: set tabstop=4 expandtab textwidth=80: + diff --git a/docs/development/requirements/04-gaps.rst b/docs/development/requirements/04-gaps.rst new file mode 100644 index 00000000..b8ff7f2e --- /dev/null +++ b/docs/development/requirements/04-gaps.rst @@ -0,0 +1,389 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Gap analysis in upstream projects +================================= + +This section presents the findings of gaps on existing VIM platforms. The focus +was to identify gaps based on the features and requirements specified in Section +3.3. The analysis work determined gaps that are presented here. + +VIM Northbound Interface +------------------------ + +Immediate Notification +^^^^^^^^^^^^^^^^^^^^^^ + +* Type: 'deficiency in performance' +* Description + + + To-be + + - VIM has to notify unavailability of virtual resource (fault) to VIM user + immediately. + - Notification should be passed in '1 second' after fault detected/notified + by VIM. + - Also, the following conditions/requirement have to be met: + + - Only the owning user can receive notification of fault related to owned + virtual resource(s). + + + As-is + + - OpenStack Metering 'Ceilometer' can notify unavailability of virtual + resource (fault) to the owner of virtual resource based on alarm + configuration by the user. + + - Ceilometer Alarm API: + http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms + + - Alarm notifications are triggered by alarm evaluator instead of + notification agents that might receive faults + + - Ceilometer Architecture: + http://docs.openstack.org/developer/ceilometer/architecture.html#id1 + + - Evaluation interval should be equal to or larger than configured pipeline + interval for collection of underlying metrics. + + - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42 + + - The interval for collection has to be set large enough which depends on + the size of the deployment and the number of metrics to be collected. + - The interval may not be less than one second in even small deployments. + The default value is 60 seconds. + - Alternative: OpenStack has a message bus to publish system events. + The operator can allow the user to connect this, but there are no + functions to filter out other events that should not be passed to the user + or which were not requested by the user. + + + Gap + + - Fault notifications cannot be received immediately by Ceilometer. + +* Solved by + + + Event Alarm Evaluator: + https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html + + New OpenStack alarms and notifications project AODH: + http://docs.openstack.org/developer/aodh/ + +Maintenance Notification +^^^^^^^^^^^^^^^^^^^^^^^^ + +* Type: 'missing' +* Description + + + To-be + + - VIM has to notify unavailability of virtual resource triggered by NFVI + maintenance to VIM user. + - Also, the following conditions/requirements have to be met: + + - VIM should accept maintenance message from administrator and mark target + physical resource "in maintenance". + - Only the owner of virtual resource hosted by target physical resource + can receive the notification that can trigger some process for + applications which are running on the virtual resource (e.g. cut off + VM). + + + As-is + + - OpenStack: None + - AWS (just for study) + + - AWS provides API and CLI to view status of resource (VM) and to create + instance status and system status alarms to notify you when an instance + has a failed status check. + http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html + - AWS provides API and CLI to view scheduled events, such as a reboot or + retirement, for your instances. Also, those events will be notified + via e-mail. + http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html + + + Gap + + - VIM user cannot receive maintenance notifications. + +* Solved by + + + https://blueprints.launchpad.net/nova/+spec/service-status-notification + +VIM Southbound interface +------------------------ + +Normalization of data collection models +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +* Type: 'missing' +* Description + + + To-be + + - A normalized data format needs to be created to cope with the many data + models from different monitoring solutions. + + + As-is + + - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti, + Zenoss). Although each solution establishes its own data models, no common + data abstraction models exist in OpenStack. + + + Gap + + - Normalized data format does not exist. + +* Solved by + + + Specification in Section :ref:`southbound`. + +OpenStack +--------- + +Ceilometer +^^^^^^^^^^ + +OpenStack offers a telemetry service, Ceilometer, for collecting measurements of +the utilization of physical and virtual resources [CEIL]_. Ceilometer can +collect a number of metrics across multiple OpenStack components and watch for +variations and trigger alarms based upon the collected data. + +Scalability of fault aggregation +________________________________ + +* Type: 'scalability issue' +* Description + + + To-be + + - Be able to scale to a large deployment, where thousands of monitoring + events per second need to be analyzed. + + + As-is + + - Performance issue when scaling to medium-sized deployments. + + + Gap + + - Ceilometer seems to be unsuitable for monitoring medium and large scale + NFVI deployments. + +* Solved by + + + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much + higher number of fault events (up to 15 thousand events per second, but + obviously also has some upper bound: + http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/ + + + Decentralized/hierarchical deployment with multiple instances, where one + instance is only responsible for a small NFVI. + +Monitoring of hardware and software +___________________________________ + +* Type: 'missing (lack of functionality)' +* Description + + + To-be + + - OpenStack (as VIM) should monitor various hardware and software in NFVI to + handle faults on them by Ceilometer. + - OpenStack may have monitoring functionality in itself and can be + integrated with third party monitoring tools. + - OpenStack need to be able to detect the faults listed in the Annex. + + + As-is + + - For each deployment of OpenStack, an operator has responsibility to + configure monitoring tools with relevant scripts or plugins in order to + monitor hardware and software. + - OpenStack Ceilometer does not monitor hardware and software to capture + faults. + + + Gap + + - Ceilometer is not able to detect and handle all faults listed in the Annex. + +* Solved by + + + Use of dedicated monitoring tools like Zabbix or Monasca. + See :ref:`nfvi_faults`. + +Nova +^^^^ + +OpenStack Nova [NOVA]_ is a mature and widely known and used component in +OpenStack cloud deployments. It is the main part of an +"infrastructure-as-a-service" system providing a cloud computing fabric +controller, supporting a wide diversity of virtualization and container +technologies. + +Nova has proven throughout these past years to be highly available and +fault-tolerant. Featuring its own API, it also provides a compatibility API with +Amazon EC2 APIs. + +Correct states when compute host is down +________________________________________ + +* Type: 'missing (lack of functionality)' +* Description + + + To-be + + - The API shall support to change VM power state in case host has failed. + - The API shall support to change nova-compute state. + - There could be single API to change different VM states for all VMs + belonging to a specific host. + - Support external systems that are monitoring the infrastructure and resources + that are able to call the API fast and reliable. + - Resource states are reliable such that correlation actions can be fast and automated. + - User shall be able to read states from OpenStack and trust they are correct. + + + As-is + + - When a VM goes down due to a host HW, host OS or hypervisor failure, + nothing happens in OpenStack. The VMs of a crashed host/hypervisor are + reported to be live and OK through the OpenStack API. + - nova-compute state might change too slowly or the state is not reliable + if expecting also VMs to be down. This leads to ability to schedule VMs + to a failed host and slowness blocks evacuation. + + + Gap + + - OpenStack does not change its states fast and reliably enough. + - The API does not support to have an external system to change states and to + trust the states are reliable (external system has fenced failed host). + - User cannot read all the states from OpenStack nor trust they are right. + +* Solved by + + + https://blueprints.launchpad.net/nova/+spec/mark-host-down + + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service + +Evacuate VMs in Maintenance mode +________________________________ + +* Type: 'missing' +* Description + + + To-be + + - When maintenance mode for a compute host is set, trigger VM evacuation to + available compute nodes before bringing the host down for maintenance. + + + As-is + + - If setting a compute node to a maintenance mode, OpenStack only schedules + evacuation of all VMs to available compute nodes if in-maintenance compute + node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g. + KVM) are not supported and, hence, guest VMs will likely stop running due + to maintenance actions administrator may perform (e.g. hardware upgrades, + OS updates). + + + Gap + + - Nova libvirt hypervisor driver does not implement automatic guest VMs + evacuation when compute nodes are set to maintenance mode (``$ nova + host-update --maintenance enable ``). + +Monasca +^^^^^^^ + +Monasca is an open-source monitoring-as-a-service (MONaaS) solution that +integrates with OpenStack. Even though it is still in its early days, it is the +interest of the community that the platform be multi-tenant, highly scalable, +performant and fault-tolerant. It provides a streaming alarm engine, a +notification engine, and a northbound REST API users can use to interact with +Monasca. Hundreds of thousands of metrics per second can be processed +[MONA]_. + +Anomaly detection +_________________ + + +* Type: 'missing (lack of functionality)' +* Description + + + To-be + + - Detect the failure and perform a root cause analysis to filter out other + alarms that may be triggered due to their cascading relation. + + + As-is + + - A mechanism to detect root causes of failures is not available. + + + Gap + + - Certain failures can trigger many alarms due to their dependency on the + underlying root cause of failure. Knowing the root cause can help filter + out unnecessary and overwhelming alarms. + +* Status + + + Monasca as of now lacks this feature, although the community is aware and + working toward supporting it. + +Sensor monitoring +_________________ + +* Type: 'missing (lack of functionality)' +* Description + + + To-be + + - It should support monitoring sensor data retrieval, for instance, from + IPMI. + + + As-is + + - Monasca does not monitor sensor data + + + Gap + + - Sensor monitoring is very important. It provides operators status + on the state of the physical infrastructure (e.g. temperature, fans). + +* Addressed by + + + Monasca can be configured to use third-party monitoring solutions (e.g. + Nagios, Cacti) for retrieving additional data. + +Hardware monitoring tools +------------------------- + +Zabbix +^^^^^^ + +Zabbix is an open-source solution for monitoring availability and performance of +infrastructure components (i.e. servers and network devices), as well as +applications [ZABB]_. It can be customized for use with OpenStack. It is a +mature tool and has been proven to be able to scale to large systems with +100,000s of devices. + +Delay in execution of actions +_____________________________ + + +* Type: 'deficiency in performance' +* Description + + + To-be + + - After detecting a fault, the monitoring tool should immediately execute + the appropriate action, e.g. inform the manager through the NB I/F + + + As-is + + - A delay of around 10 seconds was measured in two independent testbed + deployments + + + Gap + + - Cause of the delay is a periodic evaluation and notification. Periodicity is configured + as 30s default value and can be reduced to 5s but not below. + https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329 + + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/05-implementation.rst b/docs/development/requirements/05-implementation.rst new file mode 100644 index 00000000..84979772 --- /dev/null +++ b/docs/development/requirements/05-implementation.rst @@ -0,0 +1,1050 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Detailed architecture and interface specification +================================================= + +This section describes a detailed implementation plan, which is based on the +high level architecture introduced in Section 3. Section 5.1 describes the +functional blocks of the Doctor architecture, which is followed by a high level +message flow in Section 5.2. Section 5.3 provides a mapping of selected existing +open source components to the building blocks of the Doctor architecture. +Thereby, the selection of components is based on their maturity and the gap +analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of +the related northbound interface and the related information elements. Finally, +Section 5.6 provides a first set of blueprints to address selected gaps required +for the realization functionalities of the Doctor project. + +.. _impl_fb: + +Functional Blocks +----------------- + +This section introduces the functional blocks to form the VIM. OpenStack was +selected as the candidate for implementation. Inside the VIM, 4 different +building blocks are defined (see :numref:`figure6`). + +.. figure:: images/figure6.png + :name: figure6 + :width: 100% + + Functional blocks + +Monitor +^^^^^^^ + +The Monitor module has the responsibility for monitoring the virtualized +infrastructure. There are already many existing tools and services (e.g. Zabbix) +to monitor different aspects of hardware and software resources which can be +used for this purpose. + +Inspector +^^^^^^^^^ + +The Inspector module has the ability a) to receive various failure notifications +regarding physical resource(s) from Monitor module(s), b) to find the affected +virtual resource(s) by querying the resource map in the Controller, and c) to +update the state of the virtual resource (and physical resource). + +The Inspector has drivers for different types of events and resources to +integrate any type of Monitor and Controller modules. It also uses a failure +policy database to decide on the failure selection and aggregation from raw +events. This failure policy database is configured by the Administrator. + +The reason for separation of the Inspector and Controller modules is to make the +Controller focus on simple operations by avoiding a tight integration of various +health check mechanisms into the Controller. + +Controller +^^^^^^^^^^ + +The Controller is responsible for maintaining the resource map (i.e. the mapping +from physical resources to virtual resources), accepting update requests for the +resource state(s) (exposing as provider API), and sending all failure events +regarding virtual resources to the Notifier. Optionally, the Controller has the +ability to force the state of a given physical resource to down in the resource +mapping when it receives failure notifications from the Inspector for that +given physical resource. +The Controller also re-calculates the capacity of the NVFI when receiving a +failure notification for a physical resource. + +In a real-world deployment, the VIM may have several controllers, one for each +resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller +maintains a database of virtual and physical resources which shall be the master +source for resource information inside the VIM. + +Notifier +^^^^^^^^ + +The focus of the Notifier is on selecting and aggregating failure events +received from the controller based on policies mandated by the Consumer. +Therefore, it allows the Consumer to subscribe for alarms regarding virtual +resources using a method such as API endpoint. After receiving a fault +event from a Controller, it will notify the fault to the Consumer by referring +to the alarm configuration which was defined by the Consumer earlier on. + +To reduce complexity of the Controller, it is a good approach for the +Controllers to emit all notifications without any filtering mechanism and have +another service (i.e. Notifier) handle those notifications properly. This is the +general philosophy of notifications in OpenStack. Note that a fault message +consumed by the Notifier is different from the fault message received by the +Inspector; the former message is related to virtual resources which are visible +to users with relevant ownership, whereas the latter is related to raw devices +or small entities which should be handled with an administrator privilege. + +The northbound interface between the Notifier and the Consumer/Administrator is +specified in :ref:`impl_nbi`. + +Sequence +-------- + +Fault Management +^^^^^^^^^^^^^^^^ + +The detailed work flow for fault management is as follows (see also :numref:`figure7`): + +1. Request to subscribe to monitor specific virtual resources. A query filter + can be used to narrow down the alarms the Consumer wants to be informed + about. +2. Each subscription request is acknowledged with a subscribe response message. + The response message contains information about the subscribed virtual + resources, in particular if a subscribed virtual resource is in "alarm" + state. +3. The NFVI sends monitoring events for resources the VIM has been subscribed + to. Note: this subscription message exchange between the VIM and NFVI is not + shown in this message flow. +4. Event correlation, fault detection and aggregation in VIM. +5. Database lookup to find the virtual resources affected by the detected fault. +6. Fault notification to Consumer. +7. The Consumer switches to standby configuration (STBY) +8. Instructions to VIM requesting certain actions to be performed on the + affected resources, for example migrate/update/terminate specific + resource(s). After reception of such instructions, the VIM is executing the + requested action, e.g. it will migrate or terminate a virtual resource. +a. Query request from Consumer to VIM to get information about the current + status of a resource. +b. Response to the query request with information about the current status of + the queried resource. In case the resource is in "fault" state, information + about the related fault(s) is returned. + +In order to allow for quick reaction to failures, the time interval between +fault detection in step 3 and the corresponding recovery actions in step 7 and 8 +shall be less than 1 second. + +.. figure:: images/figure7.png + :name: figure7 + :width: 100% + + Fault management work flow + +.. figure:: images/figure8.png + :name: figure8 + :width: 100% + + Fault management scenario + +:numref:`figure8` shows a more detailed message flow (Steps 4 to 6) between +the 4 building blocks introduced in :ref:`impl_fb`. + +4. The Monitor observed a fault in the NFVI and reports the raw fault to the + Inspector. + The Inspector filters and aggregates the faults using pre-configured + failure policies. + +5. + a) The Inspector queries the Resource Map to find the virtual resources + affected by the raw fault in the NFVI. + b) The Inspector updates the state of the affected virtual resources in the + Resource Map. + c) The Controller observes a change of the virtual resource state and informs + the Notifier about the state change and the related alarm(s). + Alternatively, the Inspector may directly inform the Notifier about it. + +6. The Notifier is performing another filtering and aggregation of the changes + and alarms based on the pre-configured alarm configuration. Finally, a fault + notification is sent to northbound to the Consumer. + +NFVI Maintenance +^^^^^^^^^^^^^^^^ +.. figure:: images/figure9.png + :name: figure9 + :width: 100% + + NFVI maintenance work flow + +The detailed work flow for NFVI maintenance is shown in :numref:`figure9` +and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI +maintenance work flow are very similar to the steps in the fault management work +flow and share a similar implementation plan in Release 1. + +1. Subscribe to fault/maintenance notifications. +2. Response to subscribe request. +3. Maintenance trigger received from administrator. +4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they + should not be used for further allocation/migration requests +5. Database lookup to find the virtual resources affected by the detected + maintenance operation. +6. Maintenance notification to Consumer. +7. The Consumer switches to standby configuration (STBY) +8. Instructions from Consumer to VIM requesting certain recovery actions to be + performed (step 8a). After reception of such instructions, the VIM is + executing the requested action in order to empty the physical resources (step + 8b). +9. Maintenance response from VIM to inform the Administrator that the physical + machines have been emptied (or the operation resulted in an error state). +10. Administrator is coordinating and executing the maintenance operation/work + on the NFVI. +a) Query request from Administrator to VIM to get information about the + current state of a resource. +b) Response to the query request with information about the current state of + the queried resource(s). In case the resource is in "maintenance" state, + information about the related maintenance operation is returned. + +.. figure:: images/figure10.png + :name: figure10 + :width: 100% + + NFVI Maintenance implementation plan + +:numref:`figure10` shows a more detailed message flow (Steps 3 to 6 and 9) +between the 4 building blocks introduced in Section 5.1.. + +3. The Administrator is sending a StateChange request to the Controller residing + in the VIM. +4. The Controller queries the Resource Map to find the virtual resources + affected by the planned maintenance operation. +5. + + a) The Controller updates the state of the affected virtual resources in the + Resource Map database. + + b) The Controller informs the Notifier about the virtual resources that will + be affected by the maintenance operation. + +6. A maintenance notification is sent to northbound to the Consumer. + +... + +9. The Controller informs the Administrator after the physical resources have + been freed. + + + +Implementation plan for OPNFV Release 1 +--------------------------------------- + +Fault management +^^^^^^^^^^^^^^^^ + +:numref:`figure11` shows the implementation plan based on OpenStack and +related components as planned for Release 1. Hereby, the Monitor can be realized +by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron +[NEUT]_, and Cinder [CIND]_ for compute, network, and storage, +respectively. The Inspector can be realized by Monasca [MONA]_ or a simple +script querying Nova in order to map between physical and virtual resources. The +Notifier will be realized by Ceilometer [CEIL]_ receiving failure events +on its notification bus. + +:numref:`figure12` shows the inner-workings of Ceilometer. After receiving +an "event" on its notification bus, first a notification agent will grab the +event and send a "notification" to the Collector. The collector writes the +notifications received to the Ceilometer databases. + +In the existing Ceilometer implementation, an alarm evaluator is periodically +polling those databases through the APIs provided. If it finds new alarms, it +will evaluate them based on the pre-defined alarm configuration, and depending +on the configuration, it will hand a message to the Alarm Notifier, which in +turn will send the alarm message northbound to the Consumer. :numref:`figure12` +also shows an optimized work flow for Ceilometer with the goal to +reduce the delay for fault notifications to the Consumer. The approach is to +implement a new notification agent (called "publisher" in Ceilometer +terminology) which is directly sending the alarm through the "Notification Bus" +to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and +5.6.3), thereby bypassing the Collector and avoiding the additional delay of the +existing polling-based alarm evaluator. The NAE is similar to the OpenStack +"Alarm Evaluator", but is triggered by incoming notifications instead of +periodically polling the OpenStack "Alarms" database for new alarms. The +Ceilometer "Alarms" database can hold three states: "normal", "insufficient +data", and "fired". It is representing a persistent alarm database. In order to +realize the Doctor requirements, we need to define new "meters" in the database +(see Section 5.6.1). + +.. figure:: images/figure11.png + :name: figure11 + :width: 100% + + Implementation plan in OpenStack (OPNFV Release 1 ”Arno”) + + +.. figure:: images/figure12.png + :name: figure12 + :width: 100% + + Implementation plan in Ceilometer architecture + + +NFVI Maintenance +^^^^^^^^^^^^^^^^ + +For NFVI Maintenance, a quite similar implementation plan exists. Instead of a +raw fault being observed by the Monitor, the Administrator is sending a +Maintenance Request through the northbound interface towards the Controller +residing in the VIM. Similar to the Fault Management use case, the Controller +(in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e. +Ceilometer in our implementation). Within Ceilometer, the same workflow as +described in the previous section applies. In addition, the Controller(s) will +take appropriate actions to evacuate the physical machines in order to prepare +them for the planned maintenance operation. After the physical machines are +emptied, the Controller will inform the Administrator that it can initiate the +maintenance. Alternatively the VMs can just be shut down and boot up on the +same host after maintenance is over. There needs to be policy for administrator +to know the plan for VMs in maintenance. + +Information elements +-------------------- + +This section introduces all attributes and information elements used in the +messages exchange on the northbound interfaces between the VIM and the VNFO and +VNFM. + +Note: The information elements will be aligned with current work in ETSI NFV IFA +working group. + + +Simple information elements: + +* SubscriptionID (Identifier): identifies a subscription to receive fault or maintenance + notifications. +* NotificationID (Identifier): identifies a fault or maintenance notification. +* VirtualResourceID (Identifier): identifies a virtual resource affected by a + fault or a maintenance action of the underlying physical resource. +* PhysicalResourceID (Identifier): identifies a physical resource affected by a + fault or maintenance action. +* VirtualResourceState (String): state of a virtual resource, e.g. "normal", + "maintenance", "down", "error". +* PhysicalResourceState (String): state of a physical resource, e.g. "normal", + "maintenance", "down", "error". +* VirtualResourceType (String): type of the virtual resource, e.g. "virtual + machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual + NIC". +* FaultID (Identifier): identifies the related fault in the underlying physical + resource. This can be used to correlate different fault notifications caused + by the same fault in the physical resource. +* FaultType (String): Type of the fault. The allowed values for this parameter + depend on the type of the related physical resource. For example, a resource + of type "compute hardware" may have faults of type "CPU failure", "memory + failure", "network card failure", etc. +* Severity (Integer): value expressing the severity of the fault. The higher the + value, the more severe the fault. +* MinSeverity (Integer): value used in filter information elements. Only faults + with a severity higher than the MinSeverity value will be notified to the + Consumer. +* EventTime (Datetime): Time when the fault was observed. +* EventStartTime and EventEndTime (Datetime): Datetime range that can be used in + a FaultQueryFilter to narrow down the faults to be queried. +* ProbableCause (String): information about the probable cause of the fault. +* CorrelatedFaultID (Integer): list of other faults correlated to this fault. +* isRootCause (Boolean): Parameter indicating if this fault is the root for + other correlated faults. If TRUE, then the faults listed in the parameter + CorrelatedFaultID are caused by this fault. +* FaultDetails (Key-value pair): provides additional information about the + fault, e.g. information about the threshold, monitored attributes, indication + of the trend of the monitored parameter. +* FirmwareVersion (String): current version of the firmware of a physical + resource. +* HypervisorVersion (String): current version of a hypervisor. +* ZoneID (Identifier): Identifier of the resource zone. A resource zone is the + logical separation of physical and software resources in an NFVI deployment + for physical isolation, redundancy, or administrative designation. +* Metadata (Key-value pair): provides additional information of a physical + resource in maintenance/error state. + +Complex information elements (see also UML diagrams in :numref:`figure13` +and :numref:`figure14`): + +* VirtualResourceInfoClass: + + + VirtualResourceID [1] (Identifier) + + VirtualResourceState [1] (String) + + Faults [0..*] (FaultClass): For each resource, all faults + including detailed information about the faults are provided. + +* FaultClass: The parameters of the FaultClass are partially based on ETSI TS + 132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in + particular describing the information elements used for alarm notifications. + + - FaultID [1] (Identifier) + - FaultType [1] (String) + - Severity [1] (Integer) + - EventTime [1] (Datetime) + - ProbableCause [1] (String) + - CorrelatedFaultID [0..*] (Identifier) + - FaultDetails [0..*] (Key-value pair) + +.. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf + +* SubscribeFilterClass + + - VirtualResourceType [0..*] (String) + - VirtualResourceID [0..*] (Identifier) + - FaultType [0..*] (String) + - MinSeverity [0..1] (Integer) + +* FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it + limits the query to certain physical resources, a certain zone, a given fault + type/severity/cause, or a specific FaultID. + + - VirtualResourceType [0..*] (String) + - VirtualResourceID [0..*] (Identifier) + - FaultType [0..*] (String) + - MinSeverity [0..1] (Integer) + - EventStartTime [0..1] (Datetime) + - EventEndTime [0..1] (Datetime) + +* PhysicalResourceStateClass: + + - PhysicalResourceID [1] (Identifier) + - PhysicalResourceState [1] (String): mandates the new state of the physical + resource. + - Metadata [0..*] (Key-value pair) + +* PhysicalResourceInfoClass: + + - PhysicalResourceID [1] (Identifier) + - PhysicalResourceState [1] (String) + - FirmwareVersion [0..1] (String) + - HypervisorVersion [0..1] (String) + - ZoneID [0..1] (Identifier) + - Metadata [0..*] (Key-value pair) + +* StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits + the query to certain physical resources, a certain zone, or a given resource + state (e.g., only resources in "maintenance" state). + + - PhysicalResourceID [1] (Identifier) + - PhysicalResourceState [1] (String) + - ZoneID [0..1] (Identifier) + +.. _impl_nbi: + +Detailed northbound interface specification +------------------------------------------- + +This section is specifying the northbound interfaces for fault management and +NFVI maintenance between the VIM on the one end and the Consumer and the +Administrator on the other ends. For each interface all messages and related +information elements are provided. + +Note: The interface definition will be aligned with current work in ETSI NFV IFA +working group . + +All of the interfaces described below are produced by the VIM and consumed by +the Consumer or Administrator. + +Fault management interface +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This interface allows the VIM to notify the Consumer about a virtual resource +that is affected by a fault, either within the virtual resource itself or by the +underlying virtualization infrastructure. The messages on this interface are +shown in :numref:`figure13` and explained in detail in the following +subsections. + +Note: The information elements used in this section are described in detail in +Section 5.4. + +.. figure:: images/figure13.png + :name: figure13 + :width: 100% + + Fault management NB I/F messages + + +SubscribeRequest (Consumer -> VIM) +__________________________________ + +Subscription from Consumer to VIM to be notified about faults of specific +resources. The faults to be notified about can be narrowed down using a +subscribe filter. + +Parameters: + +- SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow + down the faults that shall be notified to the Consumer, for example limit to + specific VirtualResourceID(s), severity, or cause of the alarm. + +SubscribeResponse (VIM -> Consumer) +___________________________________ + +Response to a subscribe request message including information about the +subscribed resources, in particular if they are in "fault/error" state. + +Parameters: + +* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It + can be used to delete or update the subscription. +* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional + information about the subscribed resources, i.e., a list of the related + resources, the current state of the resources, etc. + +FaultNotification (VIM -> Consumer) +___________________________________ + +Notification about a virtual resource that is affected by a fault, either within +the virtual resource itself or by the underlying virtualization infrastructure. +After reception of this request, the Consumer will decide on the optimal +action to resolve the fault. This includes actions like switching to a hot +standby virtual resource, migration of the fault virtual resource to another +physical machine, termination of the faulty virtual resource and instantiation +of a new virtual resource in order to provide a new hot standby resource. In +some use cases the Consumer can leave virtual resources on failed host to be +booted up again after fault is recovered. Existing resource management +interfaces and messages between the Consumer and the VIM can be used for those +actions, and there is no need to define additional actions on the Fault +Management Interface. + +Parameters: + +* NotificationID [1] (Identifier): Unique identifier for the notification. +* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty + resources with detailed information about the faults. + +FaultQueryRequest (Consumer -> VIM) +___________________________________ + +Request to find out about active alarms at the VIM. A FaultQueryFilter can be +used to narrow down the alarms returned in the response message. + +Parameters: + +* FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the + FaultQueryRequest, for example it limits the query to certain physical + resources, a certain zone, a given fault type/severity/cause, or a specific + FaultID. + +FaultQueryResponse (VIM -> Consumer) +____________________________________ + +List of active alarms at the VIM matching the FaultQueryFilter specified in the +FaultQueryRequest. + +Parameters: + +* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty + resources. For each resource all faults including detailed information about + the faults are provided. + +NFVI maintenance +^^^^^^^^^^^^^^^^ + +The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to +maintenance notifications provided by the VIM. The related maintenance interface +Administrator-VIM allows the Administrator to issue maintenance requests to the +VIM, i.e. requesting the VIM to take appropriate actions to empty physical +machine(s) in order to execute maintenance operations on them. The interface +also allows the Administrator to query the state of physical machines, e.g., in +order to get details in the current status of the maintenance operation like a +firmware update. + +The messages defined in these northbound interfaces are shown in :numref:`figure14` +and described in detail in the following subsections. + +.. figure:: images/figure14.png + :name: figure14 + :width: 100% + + NFVI maintenance NB I/F messages + +SubscribeRequest (Consumer -> VIM) +__________________________________ + +Subscription from Consumer to VIM to be notified about maintenance operations +for specific virtual resources. The resources to be informed about can be +narrowed down using a subscribe filter. + +Parameters: + +* SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the + faults that shall be notified to the Consumer, for example limit to specific + virtual resource type(s). + +SubscribeResponse (VIM -> Consumer) +___________________________________ + +Response to a subscribe request message, including information about the +subscribed virtual resources, in particular if they are in "maintenance" state. + +Parameters: + +* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It + can be used to delete or update the subscription. +* VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional + information about the subscribed virtual resource(s), e.g., the ID, type and + current state of the resource(s). + +MaintenanceNotification (VIM -> Consumer) +_________________________________________ + +Notification about a physical resource switched to "maintenance" state. After +reception of this request, the Consumer will decide on the optimal action to +address this request, e.g., to switch to the standby (STBY) configuration. + +Parameters: + +* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual + resources where the state has been changed to maintenance. + +StateChangeRequest (Administrator -> VIM) +_________________________________________ + +Request to change the state of a list of physical resources, e.g. to +"maintenance" state, in order to prepare them for a planned maintenance +operation. + +Parameters: + +* PhysicalResourceState [1..*] (PhysicalResourceStateClass) + +StateChangeResponse (VIM -> Administrator) +__________________________________________ + +Response message to inform the Administrator that the requested resources are +now in maintenance state (or the operation resulted in an error) and the +maintenance operation(s) can be executed. + +Parameters: + +* PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass) + +StateQueryRequest (Administrator -> VIM) +________________________________________ + +In this procedure, the Administrator would like to get the information about +physical machine(s), e.g. their state ("normal", "maintenance"), firmware +version, hypervisor version, update status of firmware and hypervisor, etc. It +can be used to check the progress during firmware update and the confirmation +after update. A filter can be used to narrow down the resources returned in the +response message. + +Parameters: + +* StateQueryFilter [1] (StateQueryFilterClass): narrows down the + StateQueryRequest, for example it limits the query to certain physical + resources, a certain zone, or a given resource state. + +StateQueryResponse (VIM -> Administrator) +_________________________________________ + +List of physical resources matching the filter specified in the +StateQueryRequest. + +Parameters: + +* PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical + resources. For each resource, information about the current state, the + firmware version, etc. is provided. + +NFV IFA, OPNFV Doctor and AODH alarms +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This section compares the alarm interfaces of ETSI NFV IFA with the specifications +of this document and the alarm class of AODH. + +ETSI NFV specifies an interface for alarms from virtualised resources in ETSI GS +NFV-IFA 005 [ENFV]_. The interface specifies an Alarm class and two notifications plus +operations to query alarm instances and to subscribe to the alarm notifications. + +The specification in this document has a structure that is very similar to the +ETSI NFV specifications. The notifications differ in that an alarm notification +in the NFV interface defines a single fault for a single resource while the +notification specified in this document can contain multiple faults for +multiple resources. The Doctor specification is lacking the detailed time stamps +of the NFV specification essential for synchronizaion of the alarm list +using the query operation. The detailed time stamps are also of value in the event +and alarm history DBs. + +AODH defines a base class for alarms, not the notifications. This means that +some of the dynamic attributes of the ETSI NFV alarm type, like alarmRaisedTime, +are not applicable to the AODH alarm class but are attributes of in the actual +notifications. (Description of these attributes will be added later.) The AODH alarm +class is lacking some attributes present in the NFV specification, fault details +and correlated alarms. Instead the AODH alarm class has attributes for actions, +rules and user and project id. + + ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations | +| | Requirement Specs | Notification | | | ++========================+========================+=====================+=============================================+=======================================+ +| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- | +| | | | is part of the AODH reason parameter. | | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May | +| | | | | be derived (e.g. in a shimlayer) from | +| | | | | the AODH alarm history. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| eventTime | \- | \- | Timestamp when alarm was first observed by | see above | +| | | | the Monitor. | | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. | +| | | | | May be added in ETSI NFV Stage 3. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. | +| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be | +| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be | +| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. | +| Minor, Warning, | | moderate, critical | | | +| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: | +| | | | | * update OPNFV Doctor specification | +| | | | | to *Enum* | +| | | | | | +| | | | | **perceivedSeverity=Indetermined**: | +| | | | | * remove value *Indetermined* in | +| | | | | IFA and map undefined values to | +| | | | | “minor” severity, or | +| | | | | * add value *indetermined* in AODH | +| | | | | and make it the default value. | +| | | | | | +| | | | | **perceivedSeverity=Cleared**: | +| | | | | * remove value *Cleared* in IFA as | +| | | | | the information about a cleared | +| | | | | alarm alarm can be derived from | +| | | | | the alarm state parameter, or | +| | | | | * add value *cleared* in AODH and | +| | | | | set a rule that the severity is | +| | | | | “cleared” when the state is *ok*. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a | +| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, | +| | | | format. | "compute.cpu.failure". | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| N/A | N/A | type = "event" | Type of the notification. For fault | \- | +| | | | notifications the type in AODH is “event”. | | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) | +| | | | | based on Vitrage topology awareness / | +| | | | | root-cause-analysis. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above | +| | | | root cause of other faults. | | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will | +| | | | | be specified in ETSI NFV Stage 3. | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ +| \- | \- | action, previous | Additional AODH alarm related parameters. | \- | ++------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ + +Table: Comparison of alarm attributes + +The primary area of improvement should be alignment of the perceived severity. This +is important for a quick and accurate evaluation of the alarm. AODH thus should +support also the X.733 values Critical, Major, Minor, Warning and Indeterminate. + +The detailed time stamps (raised, changed, cleared) which are essential for +synchronizing the alarm list using a query operation should be added to the +Doctor specification. + +Other areas that need alignment is the so called alarm state in NFV. Here we must +however consider what can be attributes of the notification vs. what should be a +property of the alarm instance. This will be analyzed later. + +.. _southbound: + +Detailed southbound interface specification +------------------------------------------- + +This section is specifying the southbound interfaces for fault management +between the Monitors and the Inspector. +Although southbound interfaces should be flexible to handle various events from +different types of Monitors, we define unified event API in order to improve +interoperability between the Monitors and the Inspector. +This is not limiting implementation of Monitor and Inspector as these could be +extended in order to support failures from intelligent inspection like prediction. + +Note: The interface definition will be aligned with current work in ETSI NFV IFA +working group. + +Fault event interface +^^^^^^^^^^^^^^^^^^^^^ + +This interface allows the Monitors to notify the Inspector about an event which +was captured by the Monitor and may effect resources managed in the VIM. + +EventNotification +_________________ + + +Event notification including fault description. +The entity of this notification is event, and not fault or error specifically. +This allows us to use generic event format or framework build out of Doctor project. +The parameters below shall be mandatory, but keys in 'Details' can be optional. + +Parameters: + +* Time [1]: Datetime when the fault was observed in the Monitor. +* Type [1]: Type of event that will be used to process correlation in Inspector. +* Details [0..1]: Details containing additional information with Key-value pair style. + Keys shall be defined depending on the Type of the event. + +E.g.: + +.. code-block:: bash + + { + 'event': { + 'time': '2016-04-12T08:00:00', + 'type': 'compute.host.down', + 'details': { + 'hostname': 'compute-1', + 'source': 'sample_monitor', + 'cause': 'link-down', + 'severity': 'critical', + 'status': 'down', + 'monitor_id': 'monitor-1', + 'monitor_event_id': '123', + } + } + } + +Optional parameters in 'Details': + +* Hostname: the hostname on which the event occurred. +* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'. +* Cause: description of the cause of this event which could be different from the type of this event. +* Severity: the severity of this event set by the monitor. +* Status: the status of target object in which error occurred. +* MonitorID: the ID of the monitor sending this event. +* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log. +* RelatedTo: the array of IDs which related to this event. + +Also, we can have bulk API to receive multiple events in a single HTTP POST +message by using the 'events' wrapper as follows: + +.. code-block:: bash + + { + 'events': [ + 'event': { + 'time': '2016-04-12T08:00:00', + 'type': 'compute.host.down', + 'details': {}, + }, + 'event': { + 'time': '2016-04-12T08:00:00', + 'type': 'compute.host.nic.error', + 'details': {}, + } + ] + } + + + + +Blueprints +---------- + +This section is listing a first set of blueprints that have been proposed by the +Doctor project to the open source community. Further blueprints addressing other +gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In +this section the following definitions are used: + +* "Event" is a message emitted by other OpenStack services such as Nova and + Neutron and is consumed by the "Notification Agents" in Ceilometer. +* "Notification" is a message generated by a "Notification Agent" in Ceilometer + based on an "event" and is delivered to the "Collectors" in Ceilometer that + store those notifications (as "sample") to the Ceilometer "Databases". + +Instance State Notification (Ceilometer) [*]_ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The Doctor project is planning to handle "events" and "notifications" regarding +Resource Status; Instance State, Port State, Host State, etc. Currently, +Ceilometer already receives "events" to identify the state of those resources, +but it does not handle and store them yet. This is why we also need a new event +definition to capture those resource states from "events" created by other +services. + +This BP proposes to add a new compute notification state to handle events from +an instance (server) from nova. It also creates a new meter "instance.state" in +OpenStack. + +.. [*] https://etherpad.opnfv.org/p/doctor_bps + +Event Publisher for Alarm (Ceilometer) [*]_ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Problem statement:** + + The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically + querying/polling the databases in order to check all alarms independently from + other processes. This is adding additional delay to the fault notification + send to the Consumer, whereas one requirement of Doctor is to react on faults + as fast as possible. + + The existing message flow is shown in :numref:`figure12`: after receiving + an "event", a "notification agent" (i.e. "event publisher") will send a + "notification" to a "Collector". The "collector" is collecting the + notifications and is updating the Ceilometer "Meter" database that is storing + information about the "sample" which is capured from original "event". The + "Alarm Evaluator" is periodically polling this databases then querying "Meter" + database based on each alarm configuration. + + In the current Ceilometer implementation, there is no possibility to directly + trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm + Evaluator" will only find out that requires firing new notification to the + Consumer when polling the database. + +**Change/feature request:** + + This BP proposes to add a new "event publisher for alarm", which is bypassing + several steps in Ceilometer in order to avoid the polling-based approach of + the existing Alarm Evaluator that makes notification slow to users. + + After receiving an "(alarm) event" by listening on the Ceilometer message + queue ("notification bus"), the new "event publisher for alarm" immediately + hands a "notification" about this event to a new Ceilometer component + "Notification-driven alarm evaluator" proposed in the other BP (see Section + 5.6.3). + + Note, the term "publisher" refers to an entity in the Ceilometer architecture + (it is a "notification agent"). It offers the capability to provide + notifications to other services outside of Ceilometer, but it is also used to + deliver notifications to other Ceilometer components (e.g. the "Collectors") + via the Ceilometer "notification bus". + +**Implementation detail** + + * "Event publisher for alarm" is part of Ceilometer + * The standard AMQP message queue is used with a new topic string. + * No new interfaces have to be added to Ceilometer. + * "Event publisher for Alarm" can be configured by the Administrator of + Ceilometer to be used as "Notification Agent" in addition to the existing + "Notifier" + * Existing alarm mechanisms of Ceilometer can be used allowing users to + configure how to distribute the "notifications" transformed from "events", + e.g. there is an option whether an ongoing alarm is re-issued or not + ("repeat_actions"). + +.. [*] https://etherpad.opnfv.org/p/doctor_bps + +Notification-driven alarm evaluator (Ceilometer) [*]_ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Problem statement:** + +The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically +querying/polling the databases in order to check all alarms independently from +other processes. This is adding additional delay to the fault notification send +to the Consumer, whereas one requirement of Doctor is to react on faults as fast +as possible. + +**Change/feature request:** + +This BP is proposing to add an alternative "Notification-driven Alarm Evaluator" +for Ceilometer that is receiving "notifications" sent by the "Event Publisher +for Alarm" described in the other BP. Once this new "Notification-driven Alarm +Evaluator" received "notification", it finds the "alarm" configurations which +may relate to the "notification" by querying the "alarm" database with some keys +i.e. resource ID, then it will evaluate each alarm with the information in that +"notification". + +After the alarm evaluation, it will perform the same way as the existing "alarm +evaluator" does for firing alarm notification to the Consumer. Similar to the +existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is +aggregating and correlating different alarms which are then provided northbound +to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can +register the alarm configuration via existing Ceilometer API [*]_. Thereby, he +can configure whether to set an alarm or not and where to send the alarms to. + +**Implementation detail** + +* The new "Notification-driven Alarm Evaluator" is part of Ceilometer. +* Most of the existing source code of the "Alarm Evaluator" can be re-used to + implement this BP +* No additional application logic is needed +* It will access the Ceilometer Databases just like the existing "Alarm + evaluator" +* Only the polling-based approach will be replaced by a listener for + "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer + "notification bus". +* No new interfaces have to be added to Ceilometer. + + +.. [*] https://etherpad.opnfv.org/p/doctor_bps +.. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting + +Report host fault to update server state immediately (Nova) [*]_ +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +**Problem statement:** + +* Nova state change for failed or unreachable host is slow and does not reliably + state host is down or not. This might cause same server instance to run twice + if action taken to evacuate instance to another host. +* Nova state for server(s) on failed host will not change, but remains active + and running. This gives the user false information about server state. +* VIM northbound interface notification of host faults towards VNFM and NFVO + should be in line with OpenStack state. This fault notification is a Telco + requirement defined in ETSI and will be implemented by OPNFV Doctor project. +* Openstack user cannot make HA actions fast and reliably by trusting server + state and host state. + +**Proposed change:** + +There needs to be a new API for Admin to state host is down. This API is used to +mark services running in host down to reflect the real situation. + +Example on compute node is: + +* When compute node is up and running::: + + vm_state: activeand power_state: running + nova-compute state: up status: enabled + +* When compute node goes down and new API is called to state host is down::: + + vm_state: stopped power_state: shutdown + nova-compute state: down status: enabled + +**Alternatives:** + +There is no attractive alternative to detect all different host faults than to +have an external tool to detect different host faults. For this kind of tool to +exist there needs to be new API in Nova to report fault. Currently there must be +some kind of workarounds implemented as cannot trust or get the states from +OpenStack fast enough. + +.. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately + +Other related BPs +^^^^^^^^^^^^^^^^^ + +This section lists some BPs related to Doctor, but proposed by drafters outside +the OPNFV community. + +pacemaker-servicegroup-driver [*]_ +__________________________________ + +This BP will detect and report host down quite fast to OpenStack. This however +might not work properly for example when management network has some problem and +host reported faulty while VM still running there. This might lead to launching +same VM instance twice causing problems. Also NB IF message needs fault reason +and for that the source needs to be a tool that detects different kind of faults +as Doctor will be doing. Also this BP might need enhancement to change server +and service states correctly. + +.. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver diff --git a/docs/development/requirements/06-summary.rst b/docs/development/requirements/06-summary.rst new file mode 100644 index 00000000..61bf3f47 --- /dev/null +++ b/docs/development/requirements/06-summary.rst @@ -0,0 +1,24 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Summary and conclusion +====================== + +The Doctor project aimed at detailing NFVI fault management and NFVI maintenance +requirements. These are indispensable operations for an Operator, and extremely +necessary to realize telco-grade high availability. High availability is a large +topic; the objective of Doctor is not to realize a complete high availability +architecture and implementation. Instead, Doctor limited itself to addressing +the fault events in NFVI, and proposes enhancements necessary in VIM, e.g. +OpenStack, to ensure VNFs availability in such fault events, taking a Telco VNFs +application level management system into account. + +The Doctor project performed a robust analysis of the requirements from NFVI +fault management and NFVI maintenance operation, concretely found out gaps in +between such requirements and the current implementation of OpenStack, and +proposed potential development plans to fill out such gaps in OpenStack. +Blueprints are already under investigation and the next step is to fill out +those gaps in OpenStack by code development in the coming releases. + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/07-annex.rst b/docs/development/requirements/07-annex.rst new file mode 100644 index 00000000..c3a7899d --- /dev/null +++ b/docs/development/requirements/07-annex.rst @@ -0,0 +1,129 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +.. _nfvi_faults: + +Annex: NFVI Faults +================================================= + +Faults in the listed elements need to be immediately notified to the Consumer in +order to perform an immediate action like live migration or switch to a hot +standby entity. In addition, the Administrator of the host should trigger a +maintenance action to, e.g., reboot the server or replace a defective hardware +element. + +Faults can be of different severity, i.e., critical, warning, or +info. Critical faults require immediate action as a severe degradation of the +system has happened or is expected. Warnings indicate that the system +performance is going down: related actions include closer (e.g. more frequent) +monitoring of that part of the system or preparation for a cold migration to a +backup VM. Info messages do not require any action. We also consider a type +"maintenance", which is no real fault, but may trigger maintenance actions +like a re-boot of the server or replacement of a faulty, but redundant HW. + +Faults can be gathered by, e.g., enabling SNMP and installing some open source +tools to catch and poll SNMP. When using for example Zabbix one can also put an +agent running on the hosts to catch any other fault. In any case of failure, the +Administrator should be notified. The following tables provide a list of high +level faults that are considered within the scope of the Doctor project +requiring immediate action by the Consumer. + +**Compute/Storage** + ++-------------------+----------+------------+-----------------+------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++===================+==========+============+=================+==================+ +| Processor/CPU | Critical | Zabbix | | Switch to hot | +| failure, CPU | | | | standby | +| condition not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Memory failure/ | Critical | Zabbix | | Switch to hot | +| Memory condition | | (IPMI) | | standby | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Network card | Critical | Zabbix/ | | Switch to hot | +| failure, e.g. | | Ceilometer | | standby | +| network adapter | | | | | +| connectivity lost | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Disk crash | Info | RAID | Network storage | Inform OAM | +| | | monitoring | is very | | +| | | | redundant (e.g. | | +| | | | RAID system) | | +| | | | and can | | +| | | | guarantee high | | +| | | | availability | | ++-------------------+----------+------------+-----------------+------------------+ +| Storage | Critical | Zabbix | | Live migration | +| controller | | (IPMI) | | if storage | +| | | | | is still | +| | | | | accessible; | +| | | | | otherwise hot | +| | | | | standby | ++-------------------+----------+------------+-----------------+------------------+ +| PDU/power | Critical | Zabbix/ | | Switch to hot | +| failure, power | | Ceilometer | | standby | +| off, server reset | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Power | Warning | SNMP | | Live migration | +| degration, power | | | | | +| redundancy lost, | | | | | +| power threshold | | | | | +| exceeded | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Chassis problem | Warning | SNMP | | Live migration | +| (e.g. fan | | | | | +| degraded/failed, | | | | | +| chassis power | | | | | +| degraded), CPU | | | | | +| fan problem, | | | | | +| temperature/ | | | | | +| thermal condition | | | | | +| not ok | | | | | ++-------------------+----------+------------+-----------------+------------------+ +| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to hot | +| | | (IPMI) | link failure | standby | ++-------------------+----------+------------+-----------------+------------------+ +| OS crash (e.g. | Critical | Zabbix | | Switch to hot | +| kernel panic) | | | | standby | ++-------------------+----------+------------+-----------------+------------------+ + +**Hypervisor** + ++----------------+----------+------------+-------------+-------------------+ +| Fault | Severity | How to | Comment | Immediate action | +| | | detect? | | to recover | ++================+==========+============+=============+===================+ +| System has | Critical | Zabbix | | Switch to hot | +| restarted | | | | standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch | +| failure | Critical | Ceilometer | | to hot standby | ++----------------+----------+------------+-------------+-------------------+ +| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM | +| status not | | service | Ceilometer | | +| retrievable | | | unreachable | | +| after certain | | | | | +| period | | | | | ++----------------+----------+------------+-------------+-------------------+ + +**Network** + ++------------------+----------+---------+----------------+---------------------+ +| Fault | Severity | How to | Comment | Immediate action to | +| | | detect? | | recover | ++==================+==========+=========+================+=====================+ +| SDN/OpenFlow | Critical | Ceilo- | | Switch to | +| switch, | | meter | | hot standby | +| controller | | | | or reconfigure | +| degraded/failed | | | | virtual network | +| | | | | topology | ++------------------+----------+---------+----------------+---------------------+ +| Hardware failure | Warning | SNMP | Redundancy of | Live migration if | +| of physical | | | physical | possible otherwise | +| switch/router | | | infrastructure | evacuation | +| | | | is reduced or | | +| | | | no longer | | +| | | | available | | ++------------------+----------+---------+----------------+---------------------+ diff --git a/docs/development/requirements/99-references.rst b/docs/development/requirements/99-references.rst new file mode 100644 index 00000000..0fd3a36a --- /dev/null +++ b/docs/development/requirements/99-references.rst @@ -0,0 +1,32 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +References and bibliography +=========================== + +.. [DOCT] OPNFV, "Doctor" requirements project, [Online]. Available at + https://wiki.opnfv.org/doctor +.. [PRED] OPNFV, "Data Collection for Failure Prediction" requirements project + [Online]. Available at https://wiki.opnfv.org/prediction +.. [OPSK] OpenStack, [Online]. Available at https://www.openstack.org/ +.. [CEIL] OpenStack Telemetry (Ceilometer), [Online]. Available at + https://wiki.openstack.org/wiki/Ceilometer +.. [NOVA] OpenStack Nova, [Online]. Available at + https://wiki.openstack.org/wiki/Nova +.. [NEUT] OpenStack Neutron, [Online]. Available at + https://wiki.openstack.org/wiki/Neutron +.. [CIND] OpenStack Cinder, [Online]. Available at + https://wiki.openstack.org/wiki/Cinder +.. [MONA] OpenStack Monasca, [Online], Available at + https://wiki.openstack.org/wiki/Monasca +.. [OSAG] OpenStack Cloud Administrator Guide, [Online]. Available at + http://docs.openstack.org/admin-guide-cloud/content/ +.. [ZABB] ZABBIX, the Enterprise-class Monitoring Solution for Everyone, + [Online]. Available at http://www.zabbix.com/ +.. [ENFV] ETSI NFV, [Online]. Available at + http://www.etsi.org/technologies-clusters/technologies/nfv + + + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/glossary.rst b/docs/development/requirements/glossary.rst new file mode 100644 index 00000000..2c82b37f --- /dev/null +++ b/docs/development/requirements/glossary.rst @@ -0,0 +1,89 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +**Definition of terms** + +Different SDOs and communities use different terminology related to +NFV/Cloud/SDN. This list tries to define an OPNFV terminology, +mapping/translating the OPNFV terms to terminology used in other contexts. + + +.. glossary:: + + ACT-STBY configuration + Failover configuration common in Telco deployments. It enables the + operator to use a standby (STBY) instance to take over the functionality + of a failed active (ACT) instance. + + Administrator + Administrator of the system, e.g. OAM in Telco context. + + Consumer + User-side Manager; consumer of the interfaces produced by the VIM; VNFM, + NFVO, or Orchestrator in ETSI NFV [ENFV]_ terminology. + + EPC + Evolved Packet Core, the main component of the core network architecture + of 3GPP's LTE communication standard. + + MME + Mobility Management Entity, an entity in the EPC dedicated to mobility + management. + + NFV + Network Function Virtualization + + NFVI + Network Function Virtualization Infrastructure; totality of all hardware + and software components which build up the environment in which VNFs are + deployed. + + S/P-GW + Serving/PDN-Gateway, two entities in the EPC dedicated to routing user + data packets and providing connectivity from the UE to external packet + data networks (PDN), respectively. + + Physical resource + Actual resources in NFVI; not visible to Consumer. + + VNFM + Virtualized Network Function Manager; functional block that is + responsible for the lifecycle management of VNF. + + NFVO + Network Functions Virtualization Orchestrator; functional block that + manages the Network Service (NS) lifecycle and coordinates the + management of NS lifecycle, VNF lifecycle (supported by the VNFM) and + NFVI resources (supported by the VIM) to ensure an optimized allocation + of the necessary resources and connectivity. + + VIM + Virtualized Infrastructure Manager; functional block that is responsible + for controlling and managing the NFVI compute, storage and network + resources, usually within one operator's Infrastructure Domain, e.g. + NFVI Point of Presence (NFVI-PoP). + + Virtual Machine (VM) + Virtualized computation environment that behaves very much like a + physical computer/server. + + Virtual network + Virtual network routes information among the network interfaces of VM + instances and physical network interfaces, providing the necessary + connectivity. + + Virtual resource + A Virtual Machine (VM), a virtual network, or virtualized storage; + Offered resources to "Consumer" as result of infrastructure + virtualization; visible to Consumer. + + Virtual Storage + Virtualized non-volatile storage allocated to a VM. + + VNF + Virtualized Network Function. Implementation of a Network Function that + can be deployed on a Network Function Virtualization Infrastructure + (NFVI). + +.. + vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/development/requirements/images/LICENSE b/docs/development/requirements/images/LICENSE new file mode 100644 index 00000000..21a2d03d --- /dev/null +++ b/docs/development/requirements/images/LICENSE @@ -0,0 +1,14 @@ +Copyright 2017 Open Platform for NFV Project, Inc. and its contributors + +Open Platform for NFV Project Documentation License +=================================================== +Any documentation developed by the "Open Platform for NFV Project" +is licensed under a Creative Commons Attribution 4.0 International License. +You should have received a copy of the license along with this. If not, +see . + +Unless required by applicable law or agreed to in writing, documentation +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/docs/development/requirements/images/figure1.png b/docs/development/requirements/images/figure1.png new file mode 100644 index 00000000..267ddddc Binary files /dev/null and b/docs/development/requirements/images/figure1.png differ diff --git a/docs/development/requirements/images/figure10.png b/docs/development/requirements/images/figure10.png new file mode 100755 index 00000000..d3268018 Binary files /dev/null and b/docs/development/requirements/images/figure10.png differ diff --git a/docs/development/requirements/images/figure11.png b/docs/development/requirements/images/figure11.png new file mode 100755 index 00000000..b5fe0f8c Binary files /dev/null and b/docs/development/requirements/images/figure11.png differ diff --git a/docs/development/requirements/images/figure12.png b/docs/development/requirements/images/figure12.png new file mode 100755 index 00000000..2d394629 Binary files /dev/null and b/docs/development/requirements/images/figure12.png differ diff --git a/docs/development/requirements/images/figure13.png b/docs/development/requirements/images/figure13.png new file mode 100755 index 00000000..5f8227a5 Binary files /dev/null and b/docs/development/requirements/images/figure13.png differ diff --git a/docs/development/requirements/images/figure14.png b/docs/development/requirements/images/figure14.png new file mode 100755 index 00000000..b65ca9ae Binary files /dev/null and b/docs/development/requirements/images/figure14.png differ diff --git a/docs/development/requirements/images/figure2.png b/docs/development/requirements/images/figure2.png new file mode 100644 index 00000000..9a3b166d Binary files /dev/null and b/docs/development/requirements/images/figure2.png differ diff --git a/docs/development/requirements/images/figure3.png b/docs/development/requirements/images/figure3.png new file mode 100755 index 00000000..ee04dfae Binary files /dev/null and b/docs/development/requirements/images/figure3.png differ diff --git a/docs/development/requirements/images/figure4.png b/docs/development/requirements/images/figure4.png new file mode 100755 index 00000000..9eff177a Binary files /dev/null and b/docs/development/requirements/images/figure4.png differ diff --git a/docs/development/requirements/images/figure5a.png b/docs/development/requirements/images/figure5a.png new file mode 100755 index 00000000..d347b412 Binary files /dev/null and b/docs/development/requirements/images/figure5a.png differ diff --git a/docs/development/requirements/images/figure5b.png b/docs/development/requirements/images/figure5b.png new file mode 100755 index 00000000..75a43669 Binary files /dev/null and b/docs/development/requirements/images/figure5b.png differ diff --git a/docs/development/requirements/images/figure5c.png b/docs/development/requirements/images/figure5c.png new file mode 100755 index 00000000..4fb2ba03 Binary files /dev/null and b/docs/development/requirements/images/figure5c.png differ diff --git a/docs/development/requirements/images/figure6.png b/docs/development/requirements/images/figure6.png new file mode 100755 index 00000000..cf0d2be9 Binary files /dev/null and b/docs/development/requirements/images/figure6.png differ diff --git a/docs/development/requirements/images/figure7.png b/docs/development/requirements/images/figure7.png new file mode 100755 index 00000000..b88a2e65 Binary files /dev/null and b/docs/development/requirements/images/figure7.png differ diff --git a/docs/development/requirements/images/figure8.png b/docs/development/requirements/images/figure8.png new file mode 100755 index 00000000..907a0b30 Binary files /dev/null and b/docs/development/requirements/images/figure8.png differ diff --git a/docs/development/requirements/images/figure9.png b/docs/development/requirements/images/figure9.png new file mode 100755 index 00000000..61501c4d Binary files /dev/null and b/docs/development/requirements/images/figure9.png differ diff --git a/docs/development/requirements/index.rst b/docs/development/requirements/index.rst new file mode 100644 index 00000000..fcbfb88e --- /dev/null +++ b/docs/development/requirements/index.rst @@ -0,0 +1,62 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +**************************************** +Doctor: Fault Management and Maintenance +**************************************** + +:Project: Doctor, https://wiki.opnfv.org/doctor +:Editors: Ashiq Khan (NTT DOCOMO), Gerald Kunzmann (NTT DOCOMO) +:Authors: Ryota Mibu (NEC), Carlos Goncalves (NEC), Tomi Juvonen (Nokia), + Tommy Lindgren (Ericsson), Bertrand Souville (NTT DOCOMO), + Balazs Gibizer (Ericsson), Ildiko Vancsa (Ericsson) and others. + +:Abstract: Doctor is an OPNFV requirement project [DOCT]_. Its scope is NFVI + fault management, and maintenance and it aims at developing and + realizing the consequent implementation for the OPNFV reference + platform. + + This deliverable is introducing the use cases and operational + scenarios for Fault Management considered in the Doctor project. + From the general features, a high level architecture describing + logical building blocks and interfaces is derived. Finally, + a detailed implementation is introduced, based on available open + source components, and a related gap analysis is done as part of + this project. The implementation plan finally discusses an initial + realization for a NFVI fault management and maintenance solution in + open source software. + +:History: + + ========== ===================================================== + Date Description + ========== ===================================================== + 02.12.2014 Project creation + 14.04.2015 Initial version of the deliverable uploaded to Gerrit + 18.05.2015 Stable version of the Doctor deliverable + 25.02.2016 Updated version for the Brahmaputra release + 26.09.2016 Updated version for the Colorado release + xx.xx.2017 Updated version for the Danube release + ========== ===================================================== + +.. raw:: latex + + \newpage + +.. include:: + glossary.rst + +.. toctree:: + :maxdepth: 4 + :numbered: + + 01-intro.rst + 02-use_cases.rst + 03-architecture.rst + 04-gaps.rst + 05-implementation.rst + 06-summary.rst + 07-annex.rst + +.. include:: + 99-references.rst diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100755 index a8349970..00000000 --- a/docs/index.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 -.. (c) 2016 OPNFV. - - -====== -Doctor -====== - -.. toctree:: - :maxdepth: 2 - :numbered: - - ./installationprocedure/index.rst - ./design/index.rst - ./manuals/index.rst - ./requirements/index.rst - ./scenarios/index.rst - ./userguide/index.rst - -Indices -======= -* :ref:`search` - diff --git a/docs/installationprocedure/feature.configuration.rst b/docs/installationprocedure/feature.configuration.rst deleted file mode 100644 index 3ddc409c..00000000 --- a/docs/installationprocedure/feature.configuration.rst +++ /dev/null @@ -1,104 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Doctor Configuration -==================== - -OPNFV installers install most components of Doctor framework including -OpenStack Nova, Neutron and Cinder (Doctor Controller) and OpenStack -Ceilometer and Aodh (Doctor Notifier) except Doctor Monitor. - -After major components of OPNFV are deployed, you can setup Doctor functions -by following instructions in this section. You can also learn detailed -steps in setup_installer() under `doctor/tests`_. - -.. _doctor/tests: https://gerrit.opnfv.org/gerrit/gitweb?p=doctor.git;a=tree;f=tests; - -Doctor Inspector ----------------- - -You need to configure one of Doctor Inspector below. - -**Doctor Sample Inspector** - -Sample Inspector is intended to show minimum functions of Doctor Inspector. - -Doctor Sample Inspector suggested to be placed in one of the controller nodes, -but it can be put on any host where Doctor Monitor can reach and access -the OpenStack Controller (Nova). - -Make sure OpenStack env parameters are set properly, so that Doctor Inspector -can issue admin actions such as compute host force-down and state update of VM. - -Then, you can configure Doctor Inspector as follows: - -.. code-block:: bash - - git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube - cd doctor/tests - INSPECTOR_PORT=12345 - python inspector.py $INSPECTOR_PORT > inspector.log 2>&1 & - -**Congress** - -OpenStack `Congress`_ is a Governance as a Service (previously Policy as a -Service). Congress implements Doctor Inspector as it can inspect a fault -situation and propagate errors onto other entities. - -.. _Congress: https://wiki.openstack.org/wiki/Congress - -Congress is deployed by OPNFV installers. You need to enable doctor -datasource driver and set policy rules. By the example configuration below, -Congress will force down nova compute service when it received a fault event -of that compute host. Also, Congress will set the state of all VMs running on -that host from ACTIVE to ERROR state. - -.. code-block:: bash - - openstack congress datasource create doctor doctor - - openstack congress policy rule create \ - --name host_down classification \ - 'host_down(host) :- - doctor:events(hostname=host, type="compute.host.down", status="down")' - - openstack congress policy rule create \ - --name active_instance_in_host classification \ - 'active_instance_in_host(vmid, host) :- - nova:servers(id=vmid, host_name=host, status="ACTIVE")' - - openstack congress policy rule create \ - --name host_force_down classification \ - 'execute[nova:services.force_down(host, "nova-compute", "True")] :- - host_down(host)' - - openstack congress policy rule create \ - --name error_vm_states classification \ - 'execute[nova:servers.reset_state(vmid, "error")] :- - host_down(host), - active_instance_in_host(vmid, host)' - -Doctor Monitor --------------- - -**Doctor Sample Monitor** - -Doctor Monitors are suggested to be placed in one of the controller nodes, -but those can be put on any host which is reachable to target compute host and -accessible by the Doctor Inspector. -You need to configure Monitors for all compute hosts one by one. - -Make sure OpenStack env parameters are set properly, so that Doctor Inspector -can issue admin actions such as compute host force-down and state update of VM. - -Then, you can configure the Doctor Monitor as follows (Example for Apex deployment): - -.. code-block:: bash - - git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube - cd doctor/tests - INSPECTOR_PORT=12345 - COMPUTE_HOST='overcloud-novacompute-1.localdomain.com' - COMPUTE_IP=192.30.9.5 - sudo python monitor.py "$COMPUTE_HOST" "$COMPUTE_IP" \ - "http://127.0.0.1:$INSPECTOR_PORT/events" > monitor.log 2>&1 & diff --git a/docs/installationprocedure/index.rst b/docs/installationprocedure/index.rst deleted file mode 100644 index b74b91f8..00000000 --- a/docs/installationprocedure/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -************************** -Doctor Configuration Guide -************************** - -.. toctree:: - :maxdepth: 2 - :numbered: - - feature.configuration.rst diff --git a/docs/manuals/get-valid-server-state.rst b/docs/manuals/get-valid-server-state.rst deleted file mode 100644 index 824ea3c2..00000000 --- a/docs/manuals/get-valid-server-state.rst +++ /dev/null @@ -1,125 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -====================== -Get valid server state -====================== - -Related Blueprints: -=================== - -https://blueprints.launchpad.net/nova/+spec/get-valid-server-state - -Problem description -=================== - -Previously when the owner of a VM has queried his VMs, he has not received -enough state information, states have not changed fast enough in the VIM and -they have not been accurate in some scenarios. With this change this gap is now -closed. - -A typical case is that, in case of a fault of a host, the user of a high -availability service running on top of that host, needs to make an immediate -switch over from the faulty host to an active standby host. Now, if the compute -host is forced down [1] as a result of that fault, the user has to be notified -about this state change such that the user can react accordingly. Similarly, -a change of the host state to "maintenance" should also be notified to the -users. - -What is changed -=============== - -A new ``host_status`` parameter is added to the ``/servers/{server_id}`` and -``/servers/detail`` endpoints in microversion 2.16. By this new parameter -user can get additional state information about the host. - -``host_status`` possible values where next value in list can override the -previous: - -- ``UP`` if nova-compute is up. -- ``UNKNOWN`` if nova-compute status was not reported by servicegroup driver - within configured time period. Default is within 60 seconds, - but can be changed with ``service_down_time`` in nova.conf. -- ``DOWN`` if nova-compute was forced down. -- ``MAINTENANCE`` if nova-compute was disabled. MAINTENANCE in API directly - means nova-compute service is disabled. Different wording is used to avoid - the impression that the whole host is down, as only scheduling of new VMs - is disabled. -- Empty string indicates there is no host for server. - -``host_status`` is returned in the response in case the policy permits. By -default the policy is for admin only in Nova policy.json:: - - "os_compute_api:servers:show:host_status": "rule:admin_api" - -For an NFV use case this has to also be enabled for the owner of the VM:: - - "os_compute_api:servers:show:host_status": "rule:admin_or_owner" - -REST API examples: -================== - -Case where nova-compute is enabled and reporting normally:: - - GET /v2.1/{tenant_id}/servers/{server_id} - - 200 OK - { - "server": { - "host_status": "UP", - ... - } - } - -Case where nova-compute is enabled, but not reporting normally:: - - GET /v2.1/{tenant_id}/servers/{server_id} - - 200 OK - { - "server": { - "host_status": "UNKNOWN", - ... - } - } - -Case where nova-compute is enabled, but forced_down:: - - GET /v2.1/{tenant_id}/servers/{server_id} - - 200 OK - { - "server": { - "host_status": "DOWN", - ... - } - } - -Case where nova-compute is disabled:: - - GET /v2.1/{tenant_id}/servers/{server_id} - - 200 OK - { - "server": { - "host_status": "MAINTENANCE", - ... - } - } - -Host Status is also visible in python-novaclient:: - - +-------+------+--------+------------+-------------+----------+-------------+ - | ID | Name | Status | Task State | Power State | Networks | Host Status | - +-------+------+--------+------------+-------------+----------+-------------+ - | 9a... | vm1 | ACTIVE | - | RUNNING | xnet=... | UP | - +-------+------+--------+------------+-------------+----------+-------------+ - -Links: -====== - -[1] Manual for OpenStack NOVA API for marking host down -http://artifacts.opnfv.org/doctor/docs/manuals/mark-host-down_manual.html - -[2] OpenStack compute manual page -http://developer.openstack.org/api-ref-compute-v2.1.html#compute-v2.1 diff --git a/docs/manuals/index.rst b/docs/manuals/index.rst deleted file mode 100644 index 05831b2b..00000000 --- a/docs/manuals/index.rst +++ /dev/null @@ -1,13 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -******* -Manuals -******* - -.. toctree:: - :numbered: - :maxdepth: 2 - -.. include:: mark-host-down_manual.rst -.. include:: get-valid-server-state.rst diff --git a/docs/manuals/mark-host-down_manual.rst b/docs/manuals/mark-host-down_manual.rst deleted file mode 100644 index 3815205d..00000000 --- a/docs/manuals/mark-host-down_manual.rst +++ /dev/null @@ -1,122 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -========================================= -OpenStack NOVA API for marking host down. -========================================= - -Related Blueprints: -=================== - - https://blueprints.launchpad.net/nova/+spec/mark-host-down - https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service - -What the API is for -=================== - - This API will give external fault monitoring system a possibility of telling - OpenStack Nova fast that compute host is down. This will immediately enable - calling of evacuation of any VM on host and further enabling faster HA - actions. - -What this API does -================== - - In OpenStack the nova-compute service state can represent the compute host - state and this new API is used to force this service down. It is assumed - that the one calling this API has made sure the host is also fenced or - powered down. This is important, so there is no chance same VM instance will - appear twice in case evacuated to new compute host. When host is recovered - by any means, the external system is responsible of calling the API again to - disable forced_down flag and let the host nova-compute service report again - host being up. If network fenced host come up again it should not boot VMs - it had if figuring out they are evacuated to other compute host. The - decision of deleting or booting VMs there used to be on host should be - enhanced later to be more reliable by Nova blueprint: - https://blueprints.launchpad.net/nova/+spec/robustify-evacuate - -REST API for forcing down: -========================== - - Parameter explanations: - tenant_id: Identifier of the tenant. - binary: Compute service binary name. - host: Compute host name. - forced_down: Compute service forced down flag. - token: Token received after successful authentication. - service_host_ip: Serving controller node ip. - - request: - PUT /v2.1/{tenant_id}/os-services/force-down - { - "binary": "nova-compute", - "host": "compute1", - "forced_down": true - } - - response: - 200 OK - { - "service": { - "host": "compute1", - "binary": "nova-compute", - "forced_down": true - } - } - - Example: - curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services - /force-down -H "Content-Type: application/json" -H "Accept: application/json - " -H "X-OpenStack-Nova-API-Version: 2.11" -H "X-Auth-Token: {token}" -d '{"b - inary": "nova-compute", "host": "compute1", "forced_down": true}' - -CLI for forcing down: -===================== - - nova service-force-down nova-compute - - Example: - nova service-force-down compute1 nova-compute - -REST API for disabling forced down: -=================================== - - Parameter explanations: - tenant_id: Identifier of the tenant. - binary: Compute service binary name. - host: Compute host name. - forced_down: Compute service forced down flag. - token: Token received after successful authentication. - service_host_ip: Serving controller node ip. - - request: - PUT /v2.1/{tenant_id}/os-services/force-down - { - "binary": "nova-compute", - "host": "compute1", - "forced_down": false - } - - response: - 200 OK - { - "service": { - "host": "compute1", - "binary": "nova-compute", - "forced_down": false - } - } - - Example: - curl -g -i -X PUT http://{service_host_ip}:8774/v2.1/{tenant_id}/os-services - /force-down -H "Content-Type: application/json" -H "Accept: application/json - " -H "X-OpenStack-Nova-API-Version: 2.11" -H "X-Auth-Token: {token}" -d '{"b - inary": "nova-compute", "host": "compute1", "forced_down": false}' - -CLI for disabling forced down: -============================== - - nova service-force-down --unset nova-compute - - Example: - nova service-force-down --unset compute1 nova-compute diff --git a/docs/release/configguide/feature.configuration.rst b/docs/release/configguide/feature.configuration.rst new file mode 100644 index 00000000..3ddc409c --- /dev/null +++ b/docs/release/configguide/feature.configuration.rst @@ -0,0 +1,104 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Doctor Configuration +==================== + +OPNFV installers install most components of Doctor framework including +OpenStack Nova, Neutron and Cinder (Doctor Controller) and OpenStack +Ceilometer and Aodh (Doctor Notifier) except Doctor Monitor. + +After major components of OPNFV are deployed, you can setup Doctor functions +by following instructions in this section. You can also learn detailed +steps in setup_installer() under `doctor/tests`_. + +.. _doctor/tests: https://gerrit.opnfv.org/gerrit/gitweb?p=doctor.git;a=tree;f=tests; + +Doctor Inspector +---------------- + +You need to configure one of Doctor Inspector below. + +**Doctor Sample Inspector** + +Sample Inspector is intended to show minimum functions of Doctor Inspector. + +Doctor Sample Inspector suggested to be placed in one of the controller nodes, +but it can be put on any host where Doctor Monitor can reach and access +the OpenStack Controller (Nova). + +Make sure OpenStack env parameters are set properly, so that Doctor Inspector +can issue admin actions such as compute host force-down and state update of VM. + +Then, you can configure Doctor Inspector as follows: + +.. code-block:: bash + + git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube + cd doctor/tests + INSPECTOR_PORT=12345 + python inspector.py $INSPECTOR_PORT > inspector.log 2>&1 & + +**Congress** + +OpenStack `Congress`_ is a Governance as a Service (previously Policy as a +Service). Congress implements Doctor Inspector as it can inspect a fault +situation and propagate errors onto other entities. + +.. _Congress: https://wiki.openstack.org/wiki/Congress + +Congress is deployed by OPNFV installers. You need to enable doctor +datasource driver and set policy rules. By the example configuration below, +Congress will force down nova compute service when it received a fault event +of that compute host. Also, Congress will set the state of all VMs running on +that host from ACTIVE to ERROR state. + +.. code-block:: bash + + openstack congress datasource create doctor doctor + + openstack congress policy rule create \ + --name host_down classification \ + 'host_down(host) :- + doctor:events(hostname=host, type="compute.host.down", status="down")' + + openstack congress policy rule create \ + --name active_instance_in_host classification \ + 'active_instance_in_host(vmid, host) :- + nova:servers(id=vmid, host_name=host, status="ACTIVE")' + + openstack congress policy rule create \ + --name host_force_down classification \ + 'execute[nova:services.force_down(host, "nova-compute", "True")] :- + host_down(host)' + + openstack congress policy rule create \ + --name error_vm_states classification \ + 'execute[nova:servers.reset_state(vmid, "error")] :- + host_down(host), + active_instance_in_host(vmid, host)' + +Doctor Monitor +-------------- + +**Doctor Sample Monitor** + +Doctor Monitors are suggested to be placed in one of the controller nodes, +but those can be put on any host which is reachable to target compute host and +accessible by the Doctor Inspector. +You need to configure Monitors for all compute hosts one by one. + +Make sure OpenStack env parameters are set properly, so that Doctor Inspector +can issue admin actions such as compute host force-down and state update of VM. + +Then, you can configure the Doctor Monitor as follows (Example for Apex deployment): + +.. code-block:: bash + + git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube + cd doctor/tests + INSPECTOR_PORT=12345 + COMPUTE_HOST='overcloud-novacompute-1.localdomain.com' + COMPUTE_IP=192.30.9.5 + sudo python monitor.py "$COMPUTE_HOST" "$COMPUTE_IP" \ + "http://127.0.0.1:$INSPECTOR_PORT/events" > monitor.log 2>&1 & diff --git a/docs/release/configguide/index.rst b/docs/release/configguide/index.rst new file mode 100644 index 00000000..ad89030b --- /dev/null +++ b/docs/release/configguide/index.rst @@ -0,0 +1,12 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +************************* +Doctor Installation Guide +************************* + +.. toctree:: + :maxdepth: 2 + :numbered: + + feature.configuration.rst diff --git a/docs/release/index.rst b/docs/release/index.rst new file mode 100644 index 00000000..6b0b43c5 --- /dev/null +++ b/docs/release/index.rst @@ -0,0 +1,19 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) 2017 OPNFV. + + +====== +Doctor +====== + +.. toctree:: + :maxdepth: 2 + :numbered: + + ./installation/index.rst + ./userguide/index.rst + +Indices +======= +* :ref:`search` diff --git a/docs/release/installation/index.rst b/docs/release/installation/index.rst new file mode 100644 index 00000000..eff14e58 --- /dev/null +++ b/docs/release/installation/index.rst @@ -0,0 +1,12 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +******************** +Doctor Release Notes +******************** + +.. toctree:: + :maxdepth: 2 + :numbered: + + releasenotes.rst diff --git a/docs/release/installation/releasenotes.rst b/docs/release/installation/releasenotes.rst new file mode 100644 index 00000000..efb7b08c --- /dev/null +++ b/docs/release/installation/releasenotes.rst @@ -0,0 +1,113 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +===================================== +OPNFV Doctor release notes (Danube) +===================================== + +Version history +=============== + ++------------+--------------+------------+-------------+ +| **Date** | **Ver.** | **Author** | **Comment** | ++============+==============+============+=============+ +| 2016-XX-XX | Danube 1.0 | ... | | ++------------+--------------+------------+-------------+ + +Important notes +=============== + +OPNFV Doctor project started as a requirement project and identified gaps +between "as-is" open source software (OSS) and an "ideal" platform for NFV. +Based on this analysis, the Doctor project proposed missing features to +upstream OSS projects. After those features were implemented, OPNFV installer +projects integrated the features to the OPNFV platform and the OPNFV +infra/testing projects verified the functionalities in the OPNFV Labs. + +This document provides an overview of the Doctor project in the OPNFV Danube +release, including new features, known issues and documentation updates. + +New features +============ + +* **FEATURE 1** + + TODO: add description including pointer to `feature1`_ and explain what it is about. + +.. _feature1: https://review.openstack.org/#/c/....../ + +Installer support and verification status +========================================= + +Integrated features +------------------- + +Minimal Doctor functionality of VIM is available in the OPNFV platform from +the Brahmaputra release. The basic Doctor framework in VIM consists of a +Controller (Nova) and a Notifier (Ceilometer+Aodh) along with a sample +Inspector and Monitor developed by the Doctor team. + +From the Danube release, key integrated features are: + +* ... + +* ... + +OPNFV installer support matrix +------------------------------ + +In the Brahmaputra release, only one installer (Apex) supported the deployment +of the basic doctor framework by configuring Doctor features. In the Danube +release, integration of Doctor features progressed in other OPNFV installers. + +TODO: TABLE TO BE UPDATED! + ++-----------+-------------------+--------------+-----------------+------------------+ +| Installer | Aodh | Nova: Force | Nova: Get valid | Congress | +| | integration | compute down | service status | integration | ++===========+===================+==============+=================+==================+ +| Apex | Available | Available | Available | Available | +| | | | (`DOCTOR-67`_), | (`APEX-135`_, | +| | | | Verified only | `APEX-158`_), | +| | | | for admin users | Not Verified | ++-----------+-------------------+--------------+-----------------+------------------+ +| Fuel | Available | Available | Available, | N/A | +| | (`DOCTOR-58`_), | | Verified only | (`FUEL-119`_) | +| | Not verified | | for admin users | | ++-----------+-------------------+--------------+-----------------+------------------+ +| Joid | Available | TBC | TBC | TBC | +| | (`JOID-76`_), | | | (`JOID-73`_) | +| | Not verified | | | | ++-----------+-------------------+--------------+-----------------+------------------+ +| Compass | Available | TBC | TBC | N/A | +| | (`COMPASS-357`_), | | | (`COMPASS-367`_) | +| | Not verified | | | | ++-----------+-------------------+--------------+-----------------+------------------+ + +.. _DOCTOR-67: https://jira.opnfv.org/browse/DOCTOR-67 +.. _APEX-135: https://jira.opnfv.org/browse/APEX-135 +.. _APEX-158: https://jira.opnfv.org/browse/APEX-158 +.. _DOCTOR-58: https://jira.opnfv.org/browse/DOCTOR-58 +.. _FUEL-119: https://jira.opnfv.org/browse/FUEL-119 +.. _JOID-76: https://jira.opnfv.org/browse/JOID-76 +.. _JOID-73: https://jira.opnfv.org/browse/JOID-73 +.. _COMPASS-357: https://jira.opnfv.org/browse/COMPASS-357 +.. _COMPASS-367: https://jira.opnfv.org/browse/COMPASS-367 + +Note: 'Not verified' means that we didn't verify the functionality by having +our own test scenario running in OPNFV CI pipeline yet. + +Documentation updates +===================== + +* **Update 1** + + Description including pointer to JIRA ticket (`DOCTOR-46`_). + +.. _DOCTOR-46: https://jira.opnfv.org/browse/DOCTOR-46 + + +Known issues +============ + +* ... diff --git a/docs/release/installation/releasenotes_colorado.rst b/docs/release/installation/releasenotes_colorado.rst new file mode 100644 index 00000000..505fbdb5 --- /dev/null +++ b/docs/release/installation/releasenotes_colorado.rst @@ -0,0 +1,170 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +===================================== +OPNFV Doctor release notes (Colorado) +===================================== + +Version history +=============== + ++------------+--------------+------------+-------------+ +| **Date** | **Ver.** | **Author** | **Comment** | ++============+==============+============+=============+ +| 2016-09-19 | Colorado 1.0 | Ryota Mibu | | ++------------+--------------+------------+-------------+ + +Important notes +=============== + +OPNFV Doctor project started as a requirement project and identified gaps +between "as-is" open source software (OSS) and an "ideal" platform for NFV. +Based on this analysis, the Doctor project proposed missing features to +upstream OSS projects. After those features were implemented, OPNFV installer +projects integrated the features to the OPNFV platform and the OPNFV +infra/testing projects verified the functionalities in the OPNFV Labs. + +This document provides an overview of the Doctor project in the OPNFV Colorado +release, including new features, known issues and documentation updates. + +New features +============ + +* **Congress as a Doctor Inspector** + + Since `Doctor driver`_ in OpenStack Congress has been implemented in Mitaka, + OpenStack Congress can now take the role of the Doctor Inspector to correlate + an error in a physical resource to the affected virtual resource(s) + immediately. + +.. _Doctor driver: https://review.openstack.org/#/c/314915/ + +Installer support and verification status +========================================= + +Integrated features +------------------- + +Minimal Doctor functionality of VIM is available in the OPNFV platform from +the Brahmaputra release. The basic Doctor framework in VIM consists of a +Controller (Nova) and a Notifier (Ceilometer+Aodh) along with a sample +Inspector and Monitor developed by the Doctor team. +From the Colorado release, key integrated features are: + +* Immediate notification upon state update of virtual resource enabled by + Ceilometer and Aodh (Aodh integration) + +* Consistent state awareness improved by having nova API to mark nova-compute + service down (Nova: Force compute down) + +* Consistent state awareness improved by exposing host status in server (VM) + information via Nova API (Nova: Get valid service status) + +* OpenStack Congress enabling policy-based flexible failure correlation + (Congress integration) + +OPNFV installer support matrix +------------------------------ + +In the Brahmaputra release, only one installer (Apex) supported the deployment +of the basic doctor framework by configuring Doctor features. In the Colorado +release, integration of Doctor features progressed in other OPNFV installers. + ++-----------+-------------------+--------------+-----------------+------------------+ +| Installer | Aodh | Nova: Force | Nova: Get valid | Congress | +| | integration | compute down | service status | integration | ++===========+===================+==============+=================+==================+ +| Apex | Available | Available | Available | Available | +| | | | (`DOCTOR-67`_), | (`APEX-135`_, | +| | | | Verified only | `APEX-158`_), | +| | | | for admin users | Not Verified | ++-----------+-------------------+--------------+-----------------+------------------+ +| Fuel | Available | Available | Available, | N/A | +| | (`DOCTOR-58`_), | | Verified only | (`FUEL-119`_) | +| | Not verified | | for admin users | | ++-----------+-------------------+--------------+-----------------+------------------+ +| Joid | Available | TBC | TBC | TBC | +| | (`JOID-76`_), | | | (`JOID-73`_) | +| | Not verified | | | | ++-----------+-------------------+--------------+-----------------+------------------+ +| Compass | Available | TBC | TBC | N/A | +| | (`COMPASS-357`_), | | | (`COMPASS-367`_) | +| | Not verified | | | | ++-----------+-------------------+--------------+-----------------+------------------+ + +.. _DOCTOR-67: https://jira.opnfv.org/browse/DOCTOR-67 +.. _APEX-135: https://jira.opnfv.org/browse/APEX-135 +.. _APEX-158: https://jira.opnfv.org/browse/APEX-158 +.. _DOCTOR-58: https://jira.opnfv.org/browse/DOCTOR-58 +.. _FUEL-119: https://jira.opnfv.org/browse/FUEL-119 +.. _JOID-76: https://jira.opnfv.org/browse/JOID-76 +.. _JOID-73: https://jira.opnfv.org/browse/JOID-73 +.. _COMPASS-357: https://jira.opnfv.org/browse/COMPASS-357 +.. _COMPASS-367: https://jira.opnfv.org/browse/COMPASS-367 + +Note: 'Not verified' means that we didn't verify the functionality by having +our own test scenario running in OPNFV CI pipeline yet. + +Documentation updates +===================== + +* **Alarm comparison** + + A report on the gap analysis across alarm specifications in ETSI NFV IFA, + OPNFV Doctor and OpenStack Aodh has been added, along with some proposals + on how to improve the alignment between SDO specification and OSS + implementation as a future work (`DOCTOR-46`_). + +.. _DOCTOR-46: https://jira.opnfv.org/browse/DOCTOR-46 + +* **Description of test scenario** + + The description of the Doctor scenario, which is running as one of the + feature verification scenarios in Functest, has been updated (`DOCTOR-53`_). + +.. _DOCTOR-53: https://jira.opnfv.org/browse/DOCTOR-53 + +* **Neutron port status update** + + Design documentation for port status update has been added, intending to + propose new features to OpenStack Neutron. + +* **SB I/F specification** + + The initial specification of the Doctor southbound interface, which is for + the Inspector to receive event messages from Monitors, has been added + (`DOCTOR-17`_). + +.. _DOCTOR-17: https://jira.opnfv.org/browse/DOCTOR-17 + +Known issues +============ + +* **Aodh 'event-alarm' is not available as default (Fuel)** + + In Fuel 9.0, Aodh integration for 'event-alarm' is not completed. + Ceilometer and Nova would be mis-configured and cannot pass event + notification to Aodh. + You can use `fuel-plugin-doctor`_ to correct Ceilometer and Nova + configuration as a workaround. See `DOCTOR-62`_. + +.. _fuel-plugin-doctor: https://github.com/openzero-zte/fuel-plugin-doctor +.. _DOCTOR-62: https://jira.opnfv.org/browse/DOCTOR-62 + +* **Security notice** + + Security notice has been raised in [*]_. Please insure that the debug option + of Flask is set to False, before running in production. + +.. [*] http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012610.html + +* **Performance issue in correct resource status (Fuel)** + + Although the Doctor project is aiming to ensure that the time interval + between detection and notification to the user is less than 1 second, we + observed that it takes more than 2 seconds in the default OPNFV deployment + using the Fuel installer [*]_. + This issue will be solved by checking the OpenStack configuration and + improving Doctor testing scenario. + +.. [*] http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012542.html diff --git a/docs/release/userguide/feature.userguide.rst b/docs/release/userguide/feature.userguide.rst new file mode 100644 index 00000000..4ae521bd --- /dev/null +++ b/docs/release/userguide/feature.userguide.rst @@ -0,0 +1,44 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +Doctor capabilities and usage +============================= +Immediate Notification +---------------------- + +Immediate notification can be used by creating 'event' type alarm via +OpenStack Alarming (Aodh) API with relevant internal components support. + +See, upstream spec document: +http://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html + +An example of a consumer of this notification can be found in the Doctor +repository. It can be executed as follows: + +.. code-block:: bash + + git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube + cd doctor/tests + CONSUMER_PORT=12346 + python consumer.py "$CONSUMER_PORT" > consumer.log 2>&1 & + +Consistent resource state awareness +----------------------------------- + +Resource state of compute host can be changed/updated according to a trigger +from a monitor running outside of OpenStack Compute (Nova) by using +force-down API. + +See +http://artifacts.opnfv.org/doctor/danube/manuals/mark-host-down_manual.html +for more detail. + +Valid compute host status given to VM owner +------------------------------------------- + +The resource state of a compute host can be retrieved by a user with the +OpenStack Compute (Nova) servers API. + +See +http://artifacts.opnfv.org/doctor/danube/manuals/get-valid-server-state.html +for more detail. diff --git a/docs/release/userguide/index.rst b/docs/release/userguide/index.rst new file mode 100644 index 00000000..c6830fd1 --- /dev/null +++ b/docs/release/userguide/index.rst @@ -0,0 +1,12 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +***************** +Doctor User Guide +***************** + +.. toctree:: + :maxdepth: 2 + :numbered: + + feature.userguide.rst diff --git a/docs/releasenotes/index.rst b/docs/releasenotes/index.rst deleted file mode 100644 index 6fa37f6a..00000000 --- a/docs/releasenotes/index.rst +++ /dev/null @@ -1,10 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -************************** -OPNFV Doctor Release Notes -************************** - -.. toctree:: - - releasenotes.rst diff --git a/docs/releasenotes/releasenotes.rst b/docs/releasenotes/releasenotes.rst deleted file mode 100644 index efb7b08c..00000000 --- a/docs/releasenotes/releasenotes.rst +++ /dev/null @@ -1,113 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -===================================== -OPNFV Doctor release notes (Danube) -===================================== - -Version history -=============== - -+------------+--------------+------------+-------------+ -| **Date** | **Ver.** | **Author** | **Comment** | -+============+==============+============+=============+ -| 2016-XX-XX | Danube 1.0 | ... | | -+------------+--------------+------------+-------------+ - -Important notes -=============== - -OPNFV Doctor project started as a requirement project and identified gaps -between "as-is" open source software (OSS) and an "ideal" platform for NFV. -Based on this analysis, the Doctor project proposed missing features to -upstream OSS projects. After those features were implemented, OPNFV installer -projects integrated the features to the OPNFV platform and the OPNFV -infra/testing projects verified the functionalities in the OPNFV Labs. - -This document provides an overview of the Doctor project in the OPNFV Danube -release, including new features, known issues and documentation updates. - -New features -============ - -* **FEATURE 1** - - TODO: add description including pointer to `feature1`_ and explain what it is about. - -.. _feature1: https://review.openstack.org/#/c/....../ - -Installer support and verification status -========================================= - -Integrated features -------------------- - -Minimal Doctor functionality of VIM is available in the OPNFV platform from -the Brahmaputra release. The basic Doctor framework in VIM consists of a -Controller (Nova) and a Notifier (Ceilometer+Aodh) along with a sample -Inspector and Monitor developed by the Doctor team. - -From the Danube release, key integrated features are: - -* ... - -* ... - -OPNFV installer support matrix ------------------------------- - -In the Brahmaputra release, only one installer (Apex) supported the deployment -of the basic doctor framework by configuring Doctor features. In the Danube -release, integration of Doctor features progressed in other OPNFV installers. - -TODO: TABLE TO BE UPDATED! - -+-----------+-------------------+--------------+-----------------+------------------+ -| Installer | Aodh | Nova: Force | Nova: Get valid | Congress | -| | integration | compute down | service status | integration | -+===========+===================+==============+=================+==================+ -| Apex | Available | Available | Available | Available | -| | | | (`DOCTOR-67`_), | (`APEX-135`_, | -| | | | Verified only | `APEX-158`_), | -| | | | for admin users | Not Verified | -+-----------+-------------------+--------------+-----------------+------------------+ -| Fuel | Available | Available | Available, | N/A | -| | (`DOCTOR-58`_), | | Verified only | (`FUEL-119`_) | -| | Not verified | | for admin users | | -+-----------+-------------------+--------------+-----------------+------------------+ -| Joid | Available | TBC | TBC | TBC | -| | (`JOID-76`_), | | | (`JOID-73`_) | -| | Not verified | | | | -+-----------+-------------------+--------------+-----------------+------------------+ -| Compass | Available | TBC | TBC | N/A | -| | (`COMPASS-357`_), | | | (`COMPASS-367`_) | -| | Not verified | | | | -+-----------+-------------------+--------------+-----------------+------------------+ - -.. _DOCTOR-67: https://jira.opnfv.org/browse/DOCTOR-67 -.. _APEX-135: https://jira.opnfv.org/browse/APEX-135 -.. _APEX-158: https://jira.opnfv.org/browse/APEX-158 -.. _DOCTOR-58: https://jira.opnfv.org/browse/DOCTOR-58 -.. _FUEL-119: https://jira.opnfv.org/browse/FUEL-119 -.. _JOID-76: https://jira.opnfv.org/browse/JOID-76 -.. _JOID-73: https://jira.opnfv.org/browse/JOID-73 -.. _COMPASS-357: https://jira.opnfv.org/browse/COMPASS-357 -.. _COMPASS-367: https://jira.opnfv.org/browse/COMPASS-367 - -Note: 'Not verified' means that we didn't verify the functionality by having -our own test scenario running in OPNFV CI pipeline yet. - -Documentation updates -===================== - -* **Update 1** - - Description including pointer to JIRA ticket (`DOCTOR-46`_). - -.. _DOCTOR-46: https://jira.opnfv.org/browse/DOCTOR-46 - - -Known issues -============ - -* ... diff --git a/docs/releasenotes/releasenotes_colorado.rst b/docs/releasenotes/releasenotes_colorado.rst deleted file mode 100644 index 505fbdb5..00000000 --- a/docs/releasenotes/releasenotes_colorado.rst +++ /dev/null @@ -1,170 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -===================================== -OPNFV Doctor release notes (Colorado) -===================================== - -Version history -=============== - -+------------+--------------+------------+-------------+ -| **Date** | **Ver.** | **Author** | **Comment** | -+============+==============+============+=============+ -| 2016-09-19 | Colorado 1.0 | Ryota Mibu | | -+------------+--------------+------------+-------------+ - -Important notes -=============== - -OPNFV Doctor project started as a requirement project and identified gaps -between "as-is" open source software (OSS) and an "ideal" platform for NFV. -Based on this analysis, the Doctor project proposed missing features to -upstream OSS projects. After those features were implemented, OPNFV installer -projects integrated the features to the OPNFV platform and the OPNFV -infra/testing projects verified the functionalities in the OPNFV Labs. - -This document provides an overview of the Doctor project in the OPNFV Colorado -release, including new features, known issues and documentation updates. - -New features -============ - -* **Congress as a Doctor Inspector** - - Since `Doctor driver`_ in OpenStack Congress has been implemented in Mitaka, - OpenStack Congress can now take the role of the Doctor Inspector to correlate - an error in a physical resource to the affected virtual resource(s) - immediately. - -.. _Doctor driver: https://review.openstack.org/#/c/314915/ - -Installer support and verification status -========================================= - -Integrated features -------------------- - -Minimal Doctor functionality of VIM is available in the OPNFV platform from -the Brahmaputra release. The basic Doctor framework in VIM consists of a -Controller (Nova) and a Notifier (Ceilometer+Aodh) along with a sample -Inspector and Monitor developed by the Doctor team. -From the Colorado release, key integrated features are: - -* Immediate notification upon state update of virtual resource enabled by - Ceilometer and Aodh (Aodh integration) - -* Consistent state awareness improved by having nova API to mark nova-compute - service down (Nova: Force compute down) - -* Consistent state awareness improved by exposing host status in server (VM) - information via Nova API (Nova: Get valid service status) - -* OpenStack Congress enabling policy-based flexible failure correlation - (Congress integration) - -OPNFV installer support matrix ------------------------------- - -In the Brahmaputra release, only one installer (Apex) supported the deployment -of the basic doctor framework by configuring Doctor features. In the Colorado -release, integration of Doctor features progressed in other OPNFV installers. - -+-----------+-------------------+--------------+-----------------+------------------+ -| Installer | Aodh | Nova: Force | Nova: Get valid | Congress | -| | integration | compute down | service status | integration | -+===========+===================+==============+=================+==================+ -| Apex | Available | Available | Available | Available | -| | | | (`DOCTOR-67`_), | (`APEX-135`_, | -| | | | Verified only | `APEX-158`_), | -| | | | for admin users | Not Verified | -+-----------+-------------------+--------------+-----------------+------------------+ -| Fuel | Available | Available | Available, | N/A | -| | (`DOCTOR-58`_), | | Verified only | (`FUEL-119`_) | -| | Not verified | | for admin users | | -+-----------+-------------------+--------------+-----------------+------------------+ -| Joid | Available | TBC | TBC | TBC | -| | (`JOID-76`_), | | | (`JOID-73`_) | -| | Not verified | | | | -+-----------+-------------------+--------------+-----------------+------------------+ -| Compass | Available | TBC | TBC | N/A | -| | (`COMPASS-357`_), | | | (`COMPASS-367`_) | -| | Not verified | | | | -+-----------+-------------------+--------------+-----------------+------------------+ - -.. _DOCTOR-67: https://jira.opnfv.org/browse/DOCTOR-67 -.. _APEX-135: https://jira.opnfv.org/browse/APEX-135 -.. _APEX-158: https://jira.opnfv.org/browse/APEX-158 -.. _DOCTOR-58: https://jira.opnfv.org/browse/DOCTOR-58 -.. _FUEL-119: https://jira.opnfv.org/browse/FUEL-119 -.. _JOID-76: https://jira.opnfv.org/browse/JOID-76 -.. _JOID-73: https://jira.opnfv.org/browse/JOID-73 -.. _COMPASS-357: https://jira.opnfv.org/browse/COMPASS-357 -.. _COMPASS-367: https://jira.opnfv.org/browse/COMPASS-367 - -Note: 'Not verified' means that we didn't verify the functionality by having -our own test scenario running in OPNFV CI pipeline yet. - -Documentation updates -===================== - -* **Alarm comparison** - - A report on the gap analysis across alarm specifications in ETSI NFV IFA, - OPNFV Doctor and OpenStack Aodh has been added, along with some proposals - on how to improve the alignment between SDO specification and OSS - implementation as a future work (`DOCTOR-46`_). - -.. _DOCTOR-46: https://jira.opnfv.org/browse/DOCTOR-46 - -* **Description of test scenario** - - The description of the Doctor scenario, which is running as one of the - feature verification scenarios in Functest, has been updated (`DOCTOR-53`_). - -.. _DOCTOR-53: https://jira.opnfv.org/browse/DOCTOR-53 - -* **Neutron port status update** - - Design documentation for port status update has been added, intending to - propose new features to OpenStack Neutron. - -* **SB I/F specification** - - The initial specification of the Doctor southbound interface, which is for - the Inspector to receive event messages from Monitors, has been added - (`DOCTOR-17`_). - -.. _DOCTOR-17: https://jira.opnfv.org/browse/DOCTOR-17 - -Known issues -============ - -* **Aodh 'event-alarm' is not available as default (Fuel)** - - In Fuel 9.0, Aodh integration for 'event-alarm' is not completed. - Ceilometer and Nova would be mis-configured and cannot pass event - notification to Aodh. - You can use `fuel-plugin-doctor`_ to correct Ceilometer and Nova - configuration as a workaround. See `DOCTOR-62`_. - -.. _fuel-plugin-doctor: https://github.com/openzero-zte/fuel-plugin-doctor -.. _DOCTOR-62: https://jira.opnfv.org/browse/DOCTOR-62 - -* **Security notice** - - Security notice has been raised in [*]_. Please insure that the debug option - of Flask is set to False, before running in production. - -.. [*] http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012610.html - -* **Performance issue in correct resource status (Fuel)** - - Although the Doctor project is aiming to ensure that the time interval - between detection and notification to the user is less than 1 second, we - observed that it takes more than 2 seconds in the default OPNFV deployment - using the Fuel installer [*]_. - This issue will be solved by checking the OpenStack configuration and - improving Doctor testing scenario. - -.. [*] http://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012542.html diff --git a/docs/requirements/01-intro.rst b/docs/requirements/01-intro.rst deleted file mode 100644 index ed666cd1..00000000 --- a/docs/requirements/01-intro.rst +++ /dev/null @@ -1,51 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Introduction -============ - -The goal of this project is to build an NFVI fault management and maintenance -framework supporting high availability of the Network Services on top of the -virtualized infrastructure. The key feature is immediate notification of -unavailability of virtualized resources from VIM, to support failure recovery, -or failure avoidance of VNFs running on them. Requirement survey and development -of missing features in NFVI and VIM are in scope of this project in order to -fulfil requirements for fault management and maintenance in NFV. - -The purpose of this requirement project is to clarify the necessary features of -NFVI fault management, and maintenance, identify missing features in the current -OpenSource implementations, provide a potential implementation architecture and -plan, provide implementation guidelines in relevant upstream projects to realize -those missing features, and define the VIM northbound interfaces necessary to -perform the task of NFVI fault management, and maintenance in alignment with -ETSI NFV [ENFV]_. - -Problem description -------------------- - -A Virtualized Infrastructure Manager (VIM), e.g. OpenStack [OPSK]_, cannot -detect certain Network Functions Virtualization Infrastructure (NFVI) faults. -This feature is necessary to detect the faults and notify the Consumer in order -to ensure the proper functioning of EPC VNFs like MME and S/P-GW. - -* EPC VNFs are often in active standby (ACT-STBY) configuration and need to - switch from STBY mode to ACT mode as soon as relevant faults are detected in - the active (ACT) VNF. - -* NFVI encompasses all elements building up the environment in which VNFs are - deployed, e.g., Physical Machines, Hypervisors, Storage, and Network elements. - -In addition, VIM, e.g. OpenStack, needs to receive maintenance instructions from -the Consumer, i.e. the operator/administrator of the VNF. - -* Change the state of certain Physical Machines (PMs), e.g. empty the PM, so - that maintenance work can be performed at these machines. - -Note: Although fault management and maintenance are different operations in NFV, -both are considered as part of this project as -- except for the trigger -- they -share a very similar work and message flow. Hence, from implementation -perspective, these two are kept together in the Doctor project because of this -high degree of similarity. - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/02-use_cases.rst b/docs/requirements/02-use_cases.rst deleted file mode 100644 index 0a1f6413..00000000 --- a/docs/requirements/02-use_cases.rst +++ /dev/null @@ -1,195 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Use cases and scenarios -======================= - -Telecom services often have very high requirements on service performance. As a -consequence they often utilize redundancy and high availability (HA) mechanisms -for both the service and the platform. The HA support may be built-in or -provided by the platform. In any case, the HA support typically has a very fast -detection and reaction time to minimize service impact. The main changes -proposed in this document are about making a clear distinction between fault -management and recovery a) within the VIM/NFVI and b) High Availability support -for VNFs on the other, claiming that HA support within a VNF or as a service -from the platform is outside the scope of Doctor and is discussed in the High -Availability for OPNFV project. Doctor should focus on detecting and remediating -faults in the NFVI. This will ensure that applications come back to a fully -redundant configuration faster than before. - -As an example, Telecom services can come with an Active-Standby (ACT-STBY) -configuration which is a (1+1) redundancy scheme. ACT and STBY nodes (aka -Physical Network Function (PNF) in ETSI NFV terminology) are in a hot standby -configuration. If an ACT node is unable to function properly due to fault or any -other reason, the STBY node is instantly made ACT, and affected services can be -provided without any service interruption. - -The ACT-STBY configuration needs to be maintained. This means, when a STBY node -is made ACT, either the previously ACT node, after recovery, shall be made STBY, -or, a new STBY node needs to be configured. The actual operations to -instantiate/configure a new STBY are similar to instantiating a new VNF and -therefore are outside the scope of this project. - -The NFVI fault management and maintenance requirements aim at providing fast -failure detection of physical and virtualized resources and remediation of the -virtualized resources provided to Consumers according to their predefined -request to enable applications to recover to a fully redundant mode of -operation. - -1. Fault management/recovery using ACT-STBY configuration (Triggered by critical - error) -2. Preventive actions based on fault prediction (Preventing service stop by - handling warnings) -3. VM Retirement (Managing service during NFVI maintenance, i.e. H/W, - Hypervisor, Host OS, maintenance) - -Faults ------- - -.. _uc-fault1: - -Fault management using ACT-STBY configuration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -In :numref:`figure1`, a system-wide view of relevant functional blocks is -presented. OpenStack is considered as the VIM implementation (aka Controller) -which has interfaces with the NFVI and the Consumers. The VNF implementation is -represented as different virtual resources marked by different colors. Consumers -(VNFM or NFVO in ETSI NFV terminology) own/manage the respective virtual -resources (VMs in this example) shown with the same colors. - -The first requirement in this use case is that the Controller needs to detect -faults in the NFVI ("1. Fault Notification" in :numref:`figure1`) affecting -the proper functioning of the virtual resources (labelled as VM-x) running on -top of it. It should be possible to configure which relevant fault items should -be detected. The VIM (e.g. OpenStack) itself could be extended to detect such -faults. Alternatively, a third party fault monitoring tool could be used which -then informs the VIM about such faults; this third party fault monitoring -element can be considered as a component of VIM from an architectural point of -view. - -Once such fault is detected, the VIM shall find out which virtual resources are -affected by this fault. In the example in :numref:`figure1`, VM-4 is -affected by a fault in the Hardware Server-3. Such mapping shall be maintained -in the VIM, depicted as the "Server-VM info" table inside the VIM. - -Once the VIM has identified which virtual resources are affected by the fault, -it needs to find out who is the Consumer (i.e. the owner/manager) of the -affected virtual resources (Step 2). In the example shown in :numref:`figure1`, -the VIM knows that for the red VM-4, the manager is the red Consumer -through an Ownership info table. The VIM then notifies (Step 3 "Fault -Notification") the red Consumer about this fault, preferably with sufficient -abstraction rather than detailed physical fault information. - -.. figure:: images/figure1.png - :name: figure1 - :width: 100% - - Fault management/recovery use case - -The Consumer then switches to STBY configuration by switching the STBY node to -ACT state (Step 4). It further initiates a process to instantiate/configure a -new STBY. However, switching to STBY mode and creating a new STBY machine is a -VNFM/NFVO level operation and therefore outside the scope of this project. -Doctor project does not create interfaces for such VNFM level configuration -operations. Yet, since the total failover time of a consumer service depends on -both the delay of such processes as well as the reaction time of Doctor -components, minimizing Doctor's reaction time is a necessary basic ingredient to -fast failover times in general. - -Once the Consumer has switched to STBY configuration, it notifies (Step 5 -"Instruction" in :numref:`figure1`) the VIM. The VIM can then take -necessary (e.g. pre-determined by the involved network operator) actions on how -to clean up the fault affected VMs (Step 6 "Execute Instruction"). - -The key issue in this use case is that a VIM (OpenStack in this context) shall -not take a standalone fault recovery action (e.g. migration of the affected VMs) -before the ACT-STBY switching is complete, as that might violate the ACT-STBY -configuration and render the node out of service. - -As an extension of the 1+1 ACT-STBY resilience pattern, a STBY instance can act as -backup to N ACT nodes (N+1). In this case, the basic information flow remains -the same, i.e., the consumer is informed of a failure in order to activate the -STBY node. However, in this case it might be useful for the failure notification -to cover a number of failed instances due to the same fault (e.g., more than one -instance might be affected by a switch failure). The reaction of the consumer -might depend on whether only one active instance has failed (similar to the -ACT-STBY case), or if more active instances are needed as well. - -Preventive actions based on fault prediction -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The fault management scenario explained in :ref:`uc-fault1` can also be -performed based on fault prediction. In such cases, in VIM, there is an -intelligent fault prediction module which, based on its NFVI monitoring -information, can predict an imminent fault in the elements of NFVI. -A simple example is raising temperature of a Hardware Server which might -trigger a pre-emptive recovery action. The requirements of such fault -prediction in the VIM are investigated in the OPNFV project "Data Collection -for Failure Prediction" [PRED]_. - -This use case is very similar to :ref:`uc-fault1`. Instead of a fault -detection (Step 1 "Fault Notification in" :numref:`figure1`), the trigger -comes from a fault prediction module in the VIM, or from a third party module -which notifies the VIM about an imminent fault. From Step 2~5, the work flow is -the same as in the "Fault management using ACT-STBY configuration" use case, -except in this case, the Consumer of a VM/VNF switches to STBY configuration -based on a predicted fault, rather than an occurred fault. - -NFVI Maintenance ----------------- - -VM Retirement -^^^^^^^^^^^^^ - -All network operators perform maintenance of their network infrastructure, both -regularly and irregularly. Besides the hardware, virtualization is expected to -increase the number of elements subject to such maintenance as NFVI holds new -elements like the hypervisor and host OS. Maintenance of a particular resource -element e.g. hardware, hypervisor etc. may render a particular server hardware -unusable until the maintenance procedure is complete. - -However, the Consumer of VMs needs to know that such resources will be -unavailable because of NFVI maintenance. The following use case is again to -ensure that the ACT-STBY configuration is not violated. A stand-alone action -(e.g. live migration) from VIM/OpenStack to empty a physical machine so that -consequent maintenance procedure could be performed may not only violate the -ACT-STBY configuration, but also have impact on real-time processing scenarios -where dedicated resources to virtual resources (e.g. VMs) are necessary and a -pause in operation (e.g. vCPU) is not allowed. The Consumer is in a position to -safely perform the switch between ACT and STBY nodes, or switch to an -alternative VNF forwarding graph so the hardware servers hosting the ACT nodes -can be emptied for the upcoming maintenance operation. Once the target hardware -servers are emptied (i.e. no virtual resources are running on top), the VIM can -mark them with an appropriate flag (i.e. "maintenance" state) such that these -servers are not considered for hosting of virtual machines until the maintenance -flag is cleared (i.e. nodes are back in "normal" status). - -A high-level view of the maintenance procedure is presented in :numref:`figure2`. -VIM/OpenStack, through its northbound interface, receives a maintenance notification -(Step 1 "Maintenance Request") from the Administrator (e.g. a network operator) -including information about which hardware is subject to maintenance. -Maintenance operations include replacement/upgrade of hardware, -update/upgrade of the hypervisor/host OS, etc. - -The consequent steps to enable the Consumer to perform ACT-STBY switching are -very similar to the fault management scenario. From VIM/OpenStack's internal -database, it finds out which virtual resources (VM-x) are running on those -particular Hardware Servers and who are the managers of those virtual resources -(Step 2). The VIM then informs the respective Consumer (VNFMs or NFVO) in Step 3 -"Maintenance Notification". Based on this, the Consumer takes necessary actions -(Step 4, e.g. switch to STBY configuration or switch VNF forwarding graphs) and -then notifies (Step 5 "Instruction") the VIM. Upon receiving such notification, -the VIM takes necessary actions (Step 6 "Execute Instruction" to empty the -Hardware Servers so that consequent maintenance operations could be performed. -Due to the similarity for Steps 2~6, the maintenance procedure and the fault -management procedure are investigated in the same project. - -.. figure:: images/figure2.png - :name: figure2 - :width: 100% - - Maintenance use case - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/03-architecture.rst b/docs/requirements/03-architecture.rst deleted file mode 100644 index b7417691..00000000 --- a/docs/requirements/03-architecture.rst +++ /dev/null @@ -1,340 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -High level architecture and general features -============================================ - -Functional overview -------------------- - -The Doctor project circles around two distinct use cases: 1) management of -failures of virtualized resources and 2) planned maintenance, e.g. migration, of -virtualized resources. Both of them may affect a VNF/application and the network -service it provides, but there is a difference in frequency and how they can be -handled. - -Failures are spontaneous events that may or may not have an impact on the -virtual resources. The Consumer should as soon as possible react to the failure, -e.g., by switching to the STBY node. The Consumer will then instruct the VIM on -how to clean up or repair the lost virtual resources, i.e. restore the VM, VLAN -or virtualized storage. How much the applications are affected varies. -Applications with built-in HA support might experience a short decrease in -retainability (e.g. an ongoing session might be lost) while keeping availability -(establishment or re-establishment of sessions are not affected), whereas the -impact on applications without built-in HA may be more serious. How much the -network service is impacted depends on how the service is implemented. With -sufficient network redundancy the service may be unaffected even when a specific -resource fails. - -On the other hand, planned maintenance impacting virtualized resources are events -that are known in advance. This group includes e.g. migration due to software -upgrades of OS and hypervisor on a compute host. Some of these might have been -requested by the application or its management solution, but there is also a -need for coordination on the actual operations on the virtual resources. There -may be an impact on the applications and the service, but since they are not -spontaneous events there is room for planning and coordination between the -application management organization and the infrastructure management -organization, including performing whatever actions that would be required to -minimize the problems. - -Failure prediction is the process of pro-actively identifying situations that -may lead to a failure in the future unless acted on by means of maintenance -activities. From applications' point of view, failure prediction may impact them -in two ways: either the warning time is so short that the application or its -management solution does not have time to react, in which case it is equal to -the failure scenario, or there is sufficient time to avoid the consequences by -means of maintenance activities, in which case it is similar to planned -maintenance. - -Architecture Overview ---------------------- - -NFV and the Cloud platform provide virtual resources and related control -functionality to users and administrators. :numref:`figure3` shows the high -level architecture of NFV focusing on the NFVI, i.e., the virtualized -infrastructure. The NFVI provides virtual resources, such as virtual machines -(VM) and virtual networks. Those virtual resources are used to run applications, -i.e. VNFs, which could be components of a network service which is managed by -the consumer of the NFVI. The VIM provides functionalities of controlling and -viewing virtual resources on hardware (physical) resources to the consumers, -i.e., users and administrators. OpenStack is a prominent candidate for this VIM. -The administrator may also directly control the NFVI without using the VIM. - -Although OpenStack is the target upstream project where the new functional -elements (Controller, Notifier, Monitor, and Inspector) are expected to be -implemented, a particular implementation method is not assumed. Some of these -elements may sit outside of OpenStack and offer a northbound interface to -OpenStack. - -General Features and Requirements ---------------------------------- - -The following features are required for the VIM to achieve high availability of -applications (e.g., MME, S/P-GW) and the Network Services: - -1. Monitoring: Monitor physical and virtual resources. -2. Detection: Detect unavailability of physical resources. -3. Correlation and Cognition: Correlate faults and identify affected virtual - resources. -4. Notification: Notify unavailable virtual resources to their Consumer(s). -5. Fencing: Shut down or isolate a faulty resource. -6. Recovery action: Execute actions to process fault recovery and maintenance. - -The time interval between the instant that an event is detected by the -monitoring system and the Consumer notification of unavailable resources shall -be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4`). - -.. figure:: images/figure3.png - :name: figure3 - :width: 100% - - High level architecture - -Monitoring -^^^^^^^^^^ - -The VIM shall monitor physical and virtual resources for unavailability and -suspicious behavior. - -Detection -^^^^^^^^^ - -The VIM shall detect unavailability and failures of physical resources that -might cause errors/faults in virtual resources running on top of them. -Unavailability of physical resource is detected by various monitoring and -managing tools for hardware and software components. This may include also -predicting upcoming faults. Note, fault prediction is out of scope of this -project and is investigated in the OPNFV "Data Collection for Failure -Prediction" project [PRED]_. - -The fault items/events to be detected shall be configurable. - -The configuration shall enable Failure Selection and Aggregation. Failure -aggregation means the VIM determines unavailability of physical resource from -more than two non-critical failures related to the same resource. - -There are two types of unavailability - immediate and future: - -* Immediate unavailability can be detected by setting traps of raw failures on - hardware monitoring tools. -* Future unavailability can be found by receiving maintenance instructions - issued by the administrator of the NFVI or by failure prediction mechanisms. - -Correlation and Cognition -^^^^^^^^^^^^^^^^^^^^^^^^^ - -The VIM shall correlate each fault to the impacted virtual resource, i.e., the -VIM shall identify unavailability of virtualized resources that are or will be -affected by failures on the physical resources under them. Unavailability of a -virtualized resource is determined by referring to the mapping of physical and -virtualized resources. - -VIM shall allow configuration of fault correlation between physical and -virtual resources. VIM shall support correlating faults: - -* between a physical resource and another physical resource -* between a physical resource and a virtual resource -* between a virtual resource and another virtual resource - -Failure aggregation is also required in this feature, e.g., a user may request -to be only notified if failures on more than two standby VMs in an (N+M) -deployment model occurred. - -Notification -^^^^^^^^^^^^ - -The VIM shall notify the alarm, i.e., unavailability of virtual resource(s), to -the Consumer owning it over the northbound interface, such that the Consumers -impacted by the failure can take appropriate actions to recover from the -failure. - -The VIM shall also notify the unavailability of physical resources to its -Administrator. - -All notifications shall be transferred immediately in order to minimize the -stalling time of the network service and to avoid over assignment caused by -delay of capability updates. - -There may be multiple consumers, so the VIM has to find out the owner of a -faulty resource. Moreover, there may be a large number of virtual and physical -resources in a real deployment, so polling the state of all resources to the VIM -would lead to heavy signaling traffic. Thus, a publication/subscription -messaging model is better suited for these notifications, as notifications are -only sent to subscribed consumers. - -Notifications will be send out along with the configuration by the consumer. -The configuration includes endpoint(s) in which the consumers can specify -multiple targets for the notification subscription, so that various and -multiple receiver functions can consume the notification message. -Also, the conditions for notifications shall be configurable, such that -the consumer can set according policies, e.g. whether it wants to receive -fault notifications or not. - -Note: the VIM should only accept notification subscriptions for each resource -by its owner or administrator. -Notifications to the Consumer about the unavailability of virtualized -resources will include a description of the fault, preferably with sufficient -abstraction rather than detailed physical fault information. - -.. _fencing: - -Fencing -^^^^^^^ -Recovery actions, e.g. safe VM evacuation, have to be preceded by fencing the -failed host. Fencing hereby means to isolate or shut down a faulty resource. -Without fencing -- when the perceived disconnection is due to some transient -or partial failure -- the evacuation might lead into two identical instances -running together and having a dangerous conflict. - -There is a cross-project definition in OpenStack of how to implement -fencing, but there has not been any progress. The general description is -available here: -https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host - -OpenStack provides some mechanisms that allow fencing of faulty resources. Some -are automatically invoked by the platform itself (e.g. Nova disables the -compute service when libvirtd stops running, preventing new VMs to be scheduled -to that node), while other mechanisms are consumer trigger-based actions (e.g. -Neutron port admin-state-up). For other fencing actions not supported by -OpenStack, the Doctor project may suggest ways to address the gap (e.g. through -means of resourcing to external tools and orchestration methods), or -documenting or implementing them upstream. - -The Doctor Inspector component will be responsible of marking resources down in -the OpenStack and back up if necessary. - -Recovery Action -^^^^^^^^^^^^^^^ - -In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by -the VIM, but all recovery actions executed by the VIM and the NFVI will be -instructed and coordinated by the Consumer. - -In a more advanced use case, the VIM may be able to recover the failed virtual -resources according to a pre-defined behavior for that resource. In principle -this means that the owner of the resource (i.e., its consumer or administrator) -can define which recovery actions shall be taken by the VIM. Examples are a -restart of the VM or migration/evacuation of the VM. - - - -High level northbound interface specification ---------------------------------------------- - -Fault Management -^^^^^^^^^^^^^^^^ - -This interface allows the Consumer to subscribe to fault notification from the -VIM. Using a filter, the Consumer can narrow down which faults should be -notified. A fault notification may trigger the Consumer to switch from ACT to -STBY configuration and initiate fault recovery actions. A fault query -request/response message exchange allows the Consumer to find out about active -alarms at the VIM. A filter can be used to narrow down the alarms returned in -the response message. - -.. figure:: images/figure4.png - :name: figure4 - :width: 100% - - High-level message flow for fault management - -The high level message flow for the fault management use case is shown in -:numref:`figure4`. -It consists of the following steps: - -1. The VIM monitors the physical and virtual resources and the fault management - workflow is triggered by a monitored fault event. -2. Event correlation, fault detection and aggregation in VIM. Note: this may - also happen after Step 3. -3. Database lookup to find the virtual resources affected by the detected fault. -4. Fault notification to Consumer. -5. The Consumer switches to standby configuration (STBY). -6. Instructions to VIM requesting certain actions to be performed on the - affected resources, for example migrate/update/terminate specific - resource(s). After reception of such instructions, the VIM is executing the - requested action, e.g., it will migrate or terminate a virtual resource. - -NFVI Maintenance -^^^^^^^^^^^^^^^^ - -The NFVI maintenance interface allows the Administrator to notify the VIM about -a planned maintenance operation on the NFVI. A maintenance operation may for -example be an update of the server firmware or the hypervisor. The -MaintenanceRequest message contains instructions to change the state of the -physical resource from 'enabled' to 'going-to-maintenance' and a timeout [#timeout]_. -After receiving the MaintenanceRequest,the VIM decides on the actions to be taken -based on maintenance policies predefined by the affected Consumer(s). - -.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time - to empty the physical resources. - -.. figure:: images/figure5a.png - :name: figure5a - :width: 100% - - High-level message flow for maintenance policy enforcement - -The high level message flow for the NFVI maintenance policy enforcement is shown -in :numref:`figure5a`. It consists of the following steps: - -1. Maintenance trigger received from Administrator. -2. VIM switches the affected physical resources to "going-to-maintenance" state e.g. so that no new - VM will be scheduled on the physical servers. -3. Database lookup to find the Consumer(s) and virtual resources affected by the maintenance - operation. -4. Maintenance policies are enforced in the VIM, e.g. affected VM(s) are shut down - on the physical server(s), or affected Consumer(s) are notified about the planned - maintenance operation (steps 4a/4b). - - -Once the affected Consumer(s) have been notified, they take specific actions (e.g. switch to standby -(STBY) configuration, request to terminate the virtual resource(s)) to allow the maintenance -action to be executed. After the physical resources have been emptied, the VIM puts the physical -resources in "in-maintenance" state and sends a MaintenanceResponse back to the Administrator. - -.. figure:: images/figure5b.png - :name: figure5b - :width: 100% - - Successful NFVI maintenance - -The high level message flow for a successful NFVI maintenance is show in :numref:`figure5b`. -It consists of the following steps: - -5. The Consumer C3 switches to standby configuration (STBY). -6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed - (steps 6a, 6b). After receiving such instructions, the VIM executes the requested - action in order to empty the physical resources (step 6c) and informs the - Consumer about the result of the actions (steps 6d, 6e). -7. The VIM switches the physical resources to "in-maintenance" state -8. Maintenance response is sent from VIM to inform the Administrator that the physical - servers have been emptied. -9. The Administrator is coordinating and executing the maintenance - operation/work on the NFVI. Note: this step is out of scope of Doctor project. - -The requested actions to empty the physical resources may not be successful (e.g. migration fails -or takes too long) and in such a case, the VIM puts the physical resources back to 'enabled' and -informs the Administrator about the problem. - -.. figure:: images/figure5c.png - :name: figure5c - :width: 100% - - Example of failed NFVI maintenance - -An example of a high level message flow to cover the failed NFVI maintenance case is -shown in :numref:`figure5c`. -It consists of the following steps: - -5. The Consumer C3 switches to standby configuration (STBY). -6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed - (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2 - (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout. -7. The VIM switches the physical resources to "enabled" state. -8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action - cannot start. - - -.. - vim: set tabstop=4 expandtab textwidth=80: - diff --git a/docs/requirements/04-gaps.rst b/docs/requirements/04-gaps.rst deleted file mode 100644 index b8ff7f2e..00000000 --- a/docs/requirements/04-gaps.rst +++ /dev/null @@ -1,389 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Gap analysis in upstream projects -================================= - -This section presents the findings of gaps on existing VIM platforms. The focus -was to identify gaps based on the features and requirements specified in Section -3.3. The analysis work determined gaps that are presented here. - -VIM Northbound Interface ------------------------- - -Immediate Notification -^^^^^^^^^^^^^^^^^^^^^^ - -* Type: 'deficiency in performance' -* Description - - + To-be - - - VIM has to notify unavailability of virtual resource (fault) to VIM user - immediately. - - Notification should be passed in '1 second' after fault detected/notified - by VIM. - - Also, the following conditions/requirement have to be met: - - - Only the owning user can receive notification of fault related to owned - virtual resource(s). - - + As-is - - - OpenStack Metering 'Ceilometer' can notify unavailability of virtual - resource (fault) to the owner of virtual resource based on alarm - configuration by the user. - - - Ceilometer Alarm API: - http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms - - - Alarm notifications are triggered by alarm evaluator instead of - notification agents that might receive faults - - - Ceilometer Architecture: - http://docs.openstack.org/developer/ceilometer/architecture.html#id1 - - - Evaluation interval should be equal to or larger than configured pipeline - interval for collection of underlying metrics. - - - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42 - - - The interval for collection has to be set large enough which depends on - the size of the deployment and the number of metrics to be collected. - - The interval may not be less than one second in even small deployments. - The default value is 60 seconds. - - Alternative: OpenStack has a message bus to publish system events. - The operator can allow the user to connect this, but there are no - functions to filter out other events that should not be passed to the user - or which were not requested by the user. - - + Gap - - - Fault notifications cannot be received immediately by Ceilometer. - -* Solved by - - + Event Alarm Evaluator: - https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html - + New OpenStack alarms and notifications project AODH: - http://docs.openstack.org/developer/aodh/ - -Maintenance Notification -^^^^^^^^^^^^^^^^^^^^^^^^ - -* Type: 'missing' -* Description - - + To-be - - - VIM has to notify unavailability of virtual resource triggered by NFVI - maintenance to VIM user. - - Also, the following conditions/requirements have to be met: - - - VIM should accept maintenance message from administrator and mark target - physical resource "in maintenance". - - Only the owner of virtual resource hosted by target physical resource - can receive the notification that can trigger some process for - applications which are running on the virtual resource (e.g. cut off - VM). - - + As-is - - - OpenStack: None - - AWS (just for study) - - - AWS provides API and CLI to view status of resource (VM) and to create - instance status and system status alarms to notify you when an instance - has a failed status check. - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html - - AWS provides API and CLI to view scheduled events, such as a reboot or - retirement, for your instances. Also, those events will be notified - via e-mail. - http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html - - + Gap - - - VIM user cannot receive maintenance notifications. - -* Solved by - - + https://blueprints.launchpad.net/nova/+spec/service-status-notification - -VIM Southbound interface ------------------------- - -Normalization of data collection models -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -* Type: 'missing' -* Description - - + To-be - - - A normalized data format needs to be created to cope with the many data - models from different monitoring solutions. - - + As-is - - - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti, - Zenoss). Although each solution establishes its own data models, no common - data abstraction models exist in OpenStack. - - + Gap - - - Normalized data format does not exist. - -* Solved by - - + Specification in Section :ref:`southbound`. - -OpenStack ---------- - -Ceilometer -^^^^^^^^^^ - -OpenStack offers a telemetry service, Ceilometer, for collecting measurements of -the utilization of physical and virtual resources [CEIL]_. Ceilometer can -collect a number of metrics across multiple OpenStack components and watch for -variations and trigger alarms based upon the collected data. - -Scalability of fault aggregation -________________________________ - -* Type: 'scalability issue' -* Description - - + To-be - - - Be able to scale to a large deployment, where thousands of monitoring - events per second need to be analyzed. - - + As-is - - - Performance issue when scaling to medium-sized deployments. - - + Gap - - - Ceilometer seems to be unsuitable for monitoring medium and large scale - NFVI deployments. - -* Solved by - - + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much - higher number of fault events (up to 15 thousand events per second, but - obviously also has some upper bound: - http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/ - - + Decentralized/hierarchical deployment with multiple instances, where one - instance is only responsible for a small NFVI. - -Monitoring of hardware and software -___________________________________ - -* Type: 'missing (lack of functionality)' -* Description - - + To-be - - - OpenStack (as VIM) should monitor various hardware and software in NFVI to - handle faults on them by Ceilometer. - - OpenStack may have monitoring functionality in itself and can be - integrated with third party monitoring tools. - - OpenStack need to be able to detect the faults listed in the Annex. - - + As-is - - - For each deployment of OpenStack, an operator has responsibility to - configure monitoring tools with relevant scripts or plugins in order to - monitor hardware and software. - - OpenStack Ceilometer does not monitor hardware and software to capture - faults. - - + Gap - - - Ceilometer is not able to detect and handle all faults listed in the Annex. - -* Solved by - - + Use of dedicated monitoring tools like Zabbix or Monasca. - See :ref:`nfvi_faults`. - -Nova -^^^^ - -OpenStack Nova [NOVA]_ is a mature and widely known and used component in -OpenStack cloud deployments. It is the main part of an -"infrastructure-as-a-service" system providing a cloud computing fabric -controller, supporting a wide diversity of virtualization and container -technologies. - -Nova has proven throughout these past years to be highly available and -fault-tolerant. Featuring its own API, it also provides a compatibility API with -Amazon EC2 APIs. - -Correct states when compute host is down -________________________________________ - -* Type: 'missing (lack of functionality)' -* Description - - + To-be - - - The API shall support to change VM power state in case host has failed. - - The API shall support to change nova-compute state. - - There could be single API to change different VM states for all VMs - belonging to a specific host. - - Support external systems that are monitoring the infrastructure and resources - that are able to call the API fast and reliable. - - Resource states are reliable such that correlation actions can be fast and automated. - - User shall be able to read states from OpenStack and trust they are correct. - - + As-is - - - When a VM goes down due to a host HW, host OS or hypervisor failure, - nothing happens in OpenStack. The VMs of a crashed host/hypervisor are - reported to be live and OK through the OpenStack API. - - nova-compute state might change too slowly or the state is not reliable - if expecting also VMs to be down. This leads to ability to schedule VMs - to a failed host and slowness blocks evacuation. - - + Gap - - - OpenStack does not change its states fast and reliably enough. - - The API does not support to have an external system to change states and to - trust the states are reliable (external system has fenced failed host). - - User cannot read all the states from OpenStack nor trust they are right. - -* Solved by - - + https://blueprints.launchpad.net/nova/+spec/mark-host-down - + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service - -Evacuate VMs in Maintenance mode -________________________________ - -* Type: 'missing' -* Description - - + To-be - - - When maintenance mode for a compute host is set, trigger VM evacuation to - available compute nodes before bringing the host down for maintenance. - - + As-is - - - If setting a compute node to a maintenance mode, OpenStack only schedules - evacuation of all VMs to available compute nodes if in-maintenance compute - node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g. - KVM) are not supported and, hence, guest VMs will likely stop running due - to maintenance actions administrator may perform (e.g. hardware upgrades, - OS updates). - - + Gap - - - Nova libvirt hypervisor driver does not implement automatic guest VMs - evacuation when compute nodes are set to maintenance mode (``$ nova - host-update --maintenance enable ``). - -Monasca -^^^^^^^ - -Monasca is an open-source monitoring-as-a-service (MONaaS) solution that -integrates with OpenStack. Even though it is still in its early days, it is the -interest of the community that the platform be multi-tenant, highly scalable, -performant and fault-tolerant. It provides a streaming alarm engine, a -notification engine, and a northbound REST API users can use to interact with -Monasca. Hundreds of thousands of metrics per second can be processed -[MONA]_. - -Anomaly detection -_________________ - - -* Type: 'missing (lack of functionality)' -* Description - - + To-be - - - Detect the failure and perform a root cause analysis to filter out other - alarms that may be triggered due to their cascading relation. - - + As-is - - - A mechanism to detect root causes of failures is not available. - - + Gap - - - Certain failures can trigger many alarms due to their dependency on the - underlying root cause of failure. Knowing the root cause can help filter - out unnecessary and overwhelming alarms. - -* Status - - + Monasca as of now lacks this feature, although the community is aware and - working toward supporting it. - -Sensor monitoring -_________________ - -* Type: 'missing (lack of functionality)' -* Description - - + To-be - - - It should support monitoring sensor data retrieval, for instance, from - IPMI. - - + As-is - - - Monasca does not monitor sensor data - - + Gap - - - Sensor monitoring is very important. It provides operators status - on the state of the physical infrastructure (e.g. temperature, fans). - -* Addressed by - - + Monasca can be configured to use third-party monitoring solutions (e.g. - Nagios, Cacti) for retrieving additional data. - -Hardware monitoring tools -------------------------- - -Zabbix -^^^^^^ - -Zabbix is an open-source solution for monitoring availability and performance of -infrastructure components (i.e. servers and network devices), as well as -applications [ZABB]_. It can be customized for use with OpenStack. It is a -mature tool and has been proven to be able to scale to large systems with -100,000s of devices. - -Delay in execution of actions -_____________________________ - - -* Type: 'deficiency in performance' -* Description - - + To-be - - - After detecting a fault, the monitoring tool should immediately execute - the appropriate action, e.g. inform the manager through the NB I/F - - + As-is - - - A delay of around 10 seconds was measured in two independent testbed - deployments - - + Gap - - - Cause of the delay is a periodic evaluation and notification. Periodicity is configured - as 30s default value and can be reduced to 5s but not below. - https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329 - - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/05-implementation.rst b/docs/requirements/05-implementation.rst deleted file mode 100644 index 84979772..00000000 --- a/docs/requirements/05-implementation.rst +++ /dev/null @@ -1,1050 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Detailed architecture and interface specification -================================================= - -This section describes a detailed implementation plan, which is based on the -high level architecture introduced in Section 3. Section 5.1 describes the -functional blocks of the Doctor architecture, which is followed by a high level -message flow in Section 5.2. Section 5.3 provides a mapping of selected existing -open source components to the building blocks of the Doctor architecture. -Thereby, the selection of components is based on their maturity and the gap -analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of -the related northbound interface and the related information elements. Finally, -Section 5.6 provides a first set of blueprints to address selected gaps required -for the realization functionalities of the Doctor project. - -.. _impl_fb: - -Functional Blocks ------------------ - -This section introduces the functional blocks to form the VIM. OpenStack was -selected as the candidate for implementation. Inside the VIM, 4 different -building blocks are defined (see :numref:`figure6`). - -.. figure:: images/figure6.png - :name: figure6 - :width: 100% - - Functional blocks - -Monitor -^^^^^^^ - -The Monitor module has the responsibility for monitoring the virtualized -infrastructure. There are already many existing tools and services (e.g. Zabbix) -to monitor different aspects of hardware and software resources which can be -used for this purpose. - -Inspector -^^^^^^^^^ - -The Inspector module has the ability a) to receive various failure notifications -regarding physical resource(s) from Monitor module(s), b) to find the affected -virtual resource(s) by querying the resource map in the Controller, and c) to -update the state of the virtual resource (and physical resource). - -The Inspector has drivers for different types of events and resources to -integrate any type of Monitor and Controller modules. It also uses a failure -policy database to decide on the failure selection and aggregation from raw -events. This failure policy database is configured by the Administrator. - -The reason for separation of the Inspector and Controller modules is to make the -Controller focus on simple operations by avoiding a tight integration of various -health check mechanisms into the Controller. - -Controller -^^^^^^^^^^ - -The Controller is responsible for maintaining the resource map (i.e. the mapping -from physical resources to virtual resources), accepting update requests for the -resource state(s) (exposing as provider API), and sending all failure events -regarding virtual resources to the Notifier. Optionally, the Controller has the -ability to force the state of a given physical resource to down in the resource -mapping when it receives failure notifications from the Inspector for that -given physical resource. -The Controller also re-calculates the capacity of the NVFI when receiving a -failure notification for a physical resource. - -In a real-world deployment, the VIM may have several controllers, one for each -resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller -maintains a database of virtual and physical resources which shall be the master -source for resource information inside the VIM. - -Notifier -^^^^^^^^ - -The focus of the Notifier is on selecting and aggregating failure events -received from the controller based on policies mandated by the Consumer. -Therefore, it allows the Consumer to subscribe for alarms regarding virtual -resources using a method such as API endpoint. After receiving a fault -event from a Controller, it will notify the fault to the Consumer by referring -to the alarm configuration which was defined by the Consumer earlier on. - -To reduce complexity of the Controller, it is a good approach for the -Controllers to emit all notifications without any filtering mechanism and have -another service (i.e. Notifier) handle those notifications properly. This is the -general philosophy of notifications in OpenStack. Note that a fault message -consumed by the Notifier is different from the fault message received by the -Inspector; the former message is related to virtual resources which are visible -to users with relevant ownership, whereas the latter is related to raw devices -or small entities which should be handled with an administrator privilege. - -The northbound interface between the Notifier and the Consumer/Administrator is -specified in :ref:`impl_nbi`. - -Sequence --------- - -Fault Management -^^^^^^^^^^^^^^^^ - -The detailed work flow for fault management is as follows (see also :numref:`figure7`): - -1. Request to subscribe to monitor specific virtual resources. A query filter - can be used to narrow down the alarms the Consumer wants to be informed - about. -2. Each subscription request is acknowledged with a subscribe response message. - The response message contains information about the subscribed virtual - resources, in particular if a subscribed virtual resource is in "alarm" - state. -3. The NFVI sends monitoring events for resources the VIM has been subscribed - to. Note: this subscription message exchange between the VIM and NFVI is not - shown in this message flow. -4. Event correlation, fault detection and aggregation in VIM. -5. Database lookup to find the virtual resources affected by the detected fault. -6. Fault notification to Consumer. -7. The Consumer switches to standby configuration (STBY) -8. Instructions to VIM requesting certain actions to be performed on the - affected resources, for example migrate/update/terminate specific - resource(s). After reception of such instructions, the VIM is executing the - requested action, e.g. it will migrate or terminate a virtual resource. -a. Query request from Consumer to VIM to get information about the current - status of a resource. -b. Response to the query request with information about the current status of - the queried resource. In case the resource is in "fault" state, information - about the related fault(s) is returned. - -In order to allow for quick reaction to failures, the time interval between -fault detection in step 3 and the corresponding recovery actions in step 7 and 8 -shall be less than 1 second. - -.. figure:: images/figure7.png - :name: figure7 - :width: 100% - - Fault management work flow - -.. figure:: images/figure8.png - :name: figure8 - :width: 100% - - Fault management scenario - -:numref:`figure8` shows a more detailed message flow (Steps 4 to 6) between -the 4 building blocks introduced in :ref:`impl_fb`. - -4. The Monitor observed a fault in the NFVI and reports the raw fault to the - Inspector. - The Inspector filters and aggregates the faults using pre-configured - failure policies. - -5. - a) The Inspector queries the Resource Map to find the virtual resources - affected by the raw fault in the NFVI. - b) The Inspector updates the state of the affected virtual resources in the - Resource Map. - c) The Controller observes a change of the virtual resource state and informs - the Notifier about the state change and the related alarm(s). - Alternatively, the Inspector may directly inform the Notifier about it. - -6. The Notifier is performing another filtering and aggregation of the changes - and alarms based on the pre-configured alarm configuration. Finally, a fault - notification is sent to northbound to the Consumer. - -NFVI Maintenance -^^^^^^^^^^^^^^^^ -.. figure:: images/figure9.png - :name: figure9 - :width: 100% - - NFVI maintenance work flow - -The detailed work flow for NFVI maintenance is shown in :numref:`figure9` -and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI -maintenance work flow are very similar to the steps in the fault management work -flow and share a similar implementation plan in Release 1. - -1. Subscribe to fault/maintenance notifications. -2. Response to subscribe request. -3. Maintenance trigger received from administrator. -4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they - should not be used for further allocation/migration requests -5. Database lookup to find the virtual resources affected by the detected - maintenance operation. -6. Maintenance notification to Consumer. -7. The Consumer switches to standby configuration (STBY) -8. Instructions from Consumer to VIM requesting certain recovery actions to be - performed (step 8a). After reception of such instructions, the VIM is - executing the requested action in order to empty the physical resources (step - 8b). -9. Maintenance response from VIM to inform the Administrator that the physical - machines have been emptied (or the operation resulted in an error state). -10. Administrator is coordinating and executing the maintenance operation/work - on the NFVI. -a) Query request from Administrator to VIM to get information about the - current state of a resource. -b) Response to the query request with information about the current state of - the queried resource(s). In case the resource is in "maintenance" state, - information about the related maintenance operation is returned. - -.. figure:: images/figure10.png - :name: figure10 - :width: 100% - - NFVI Maintenance implementation plan - -:numref:`figure10` shows a more detailed message flow (Steps 3 to 6 and 9) -between the 4 building blocks introduced in Section 5.1.. - -3. The Administrator is sending a StateChange request to the Controller residing - in the VIM. -4. The Controller queries the Resource Map to find the virtual resources - affected by the planned maintenance operation. -5. - - a) The Controller updates the state of the affected virtual resources in the - Resource Map database. - - b) The Controller informs the Notifier about the virtual resources that will - be affected by the maintenance operation. - -6. A maintenance notification is sent to northbound to the Consumer. - -... - -9. The Controller informs the Administrator after the physical resources have - been freed. - - - -Implementation plan for OPNFV Release 1 ---------------------------------------- - -Fault management -^^^^^^^^^^^^^^^^ - -:numref:`figure11` shows the implementation plan based on OpenStack and -related components as planned for Release 1. Hereby, the Monitor can be realized -by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron -[NEUT]_, and Cinder [CIND]_ for compute, network, and storage, -respectively. The Inspector can be realized by Monasca [MONA]_ or a simple -script querying Nova in order to map between physical and virtual resources. The -Notifier will be realized by Ceilometer [CEIL]_ receiving failure events -on its notification bus. - -:numref:`figure12` shows the inner-workings of Ceilometer. After receiving -an "event" on its notification bus, first a notification agent will grab the -event and send a "notification" to the Collector. The collector writes the -notifications received to the Ceilometer databases. - -In the existing Ceilometer implementation, an alarm evaluator is periodically -polling those databases through the APIs provided. If it finds new alarms, it -will evaluate them based on the pre-defined alarm configuration, and depending -on the configuration, it will hand a message to the Alarm Notifier, which in -turn will send the alarm message northbound to the Consumer. :numref:`figure12` -also shows an optimized work flow for Ceilometer with the goal to -reduce the delay for fault notifications to the Consumer. The approach is to -implement a new notification agent (called "publisher" in Ceilometer -terminology) which is directly sending the alarm through the "Notification Bus" -to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and -5.6.3), thereby bypassing the Collector and avoiding the additional delay of the -existing polling-based alarm evaluator. The NAE is similar to the OpenStack -"Alarm Evaluator", but is triggered by incoming notifications instead of -periodically polling the OpenStack "Alarms" database for new alarms. The -Ceilometer "Alarms" database can hold three states: "normal", "insufficient -data", and "fired". It is representing a persistent alarm database. In order to -realize the Doctor requirements, we need to define new "meters" in the database -(see Section 5.6.1). - -.. figure:: images/figure11.png - :name: figure11 - :width: 100% - - Implementation plan in OpenStack (OPNFV Release 1 ”Arno”) - - -.. figure:: images/figure12.png - :name: figure12 - :width: 100% - - Implementation plan in Ceilometer architecture - - -NFVI Maintenance -^^^^^^^^^^^^^^^^ - -For NFVI Maintenance, a quite similar implementation plan exists. Instead of a -raw fault being observed by the Monitor, the Administrator is sending a -Maintenance Request through the northbound interface towards the Controller -residing in the VIM. Similar to the Fault Management use case, the Controller -(in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e. -Ceilometer in our implementation). Within Ceilometer, the same workflow as -described in the previous section applies. In addition, the Controller(s) will -take appropriate actions to evacuate the physical machines in order to prepare -them for the planned maintenance operation. After the physical machines are -emptied, the Controller will inform the Administrator that it can initiate the -maintenance. Alternatively the VMs can just be shut down and boot up on the -same host after maintenance is over. There needs to be policy for administrator -to know the plan for VMs in maintenance. - -Information elements --------------------- - -This section introduces all attributes and information elements used in the -messages exchange on the northbound interfaces between the VIM and the VNFO and -VNFM. - -Note: The information elements will be aligned with current work in ETSI NFV IFA -working group. - - -Simple information elements: - -* SubscriptionID (Identifier): identifies a subscription to receive fault or maintenance - notifications. -* NotificationID (Identifier): identifies a fault or maintenance notification. -* VirtualResourceID (Identifier): identifies a virtual resource affected by a - fault or a maintenance action of the underlying physical resource. -* PhysicalResourceID (Identifier): identifies a physical resource affected by a - fault or maintenance action. -* VirtualResourceState (String): state of a virtual resource, e.g. "normal", - "maintenance", "down", "error". -* PhysicalResourceState (String): state of a physical resource, e.g. "normal", - "maintenance", "down", "error". -* VirtualResourceType (String): type of the virtual resource, e.g. "virtual - machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual - NIC". -* FaultID (Identifier): identifies the related fault in the underlying physical - resource. This can be used to correlate different fault notifications caused - by the same fault in the physical resource. -* FaultType (String): Type of the fault. The allowed values for this parameter - depend on the type of the related physical resource. For example, a resource - of type "compute hardware" may have faults of type "CPU failure", "memory - failure", "network card failure", etc. -* Severity (Integer): value expressing the severity of the fault. The higher the - value, the more severe the fault. -* MinSeverity (Integer): value used in filter information elements. Only faults - with a severity higher than the MinSeverity value will be notified to the - Consumer. -* EventTime (Datetime): Time when the fault was observed. -* EventStartTime and EventEndTime (Datetime): Datetime range that can be used in - a FaultQueryFilter to narrow down the faults to be queried. -* ProbableCause (String): information about the probable cause of the fault. -* CorrelatedFaultID (Integer): list of other faults correlated to this fault. -* isRootCause (Boolean): Parameter indicating if this fault is the root for - other correlated faults. If TRUE, then the faults listed in the parameter - CorrelatedFaultID are caused by this fault. -* FaultDetails (Key-value pair): provides additional information about the - fault, e.g. information about the threshold, monitored attributes, indication - of the trend of the monitored parameter. -* FirmwareVersion (String): current version of the firmware of a physical - resource. -* HypervisorVersion (String): current version of a hypervisor. -* ZoneID (Identifier): Identifier of the resource zone. A resource zone is the - logical separation of physical and software resources in an NFVI deployment - for physical isolation, redundancy, or administrative designation. -* Metadata (Key-value pair): provides additional information of a physical - resource in maintenance/error state. - -Complex information elements (see also UML diagrams in :numref:`figure13` -and :numref:`figure14`): - -* VirtualResourceInfoClass: - - + VirtualResourceID [1] (Identifier) - + VirtualResourceState [1] (String) - + Faults [0..*] (FaultClass): For each resource, all faults - including detailed information about the faults are provided. - -* FaultClass: The parameters of the FaultClass are partially based on ETSI TS - 132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in - particular describing the information elements used for alarm notifications. - - - FaultID [1] (Identifier) - - FaultType [1] (String) - - Severity [1] (Integer) - - EventTime [1] (Datetime) - - ProbableCause [1] (String) - - CorrelatedFaultID [0..*] (Identifier) - - FaultDetails [0..*] (Key-value pair) - -.. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf - -* SubscribeFilterClass - - - VirtualResourceType [0..*] (String) - - VirtualResourceID [0..*] (Identifier) - - FaultType [0..*] (String) - - MinSeverity [0..1] (Integer) - -* FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it - limits the query to certain physical resources, a certain zone, a given fault - type/severity/cause, or a specific FaultID. - - - VirtualResourceType [0..*] (String) - - VirtualResourceID [0..*] (Identifier) - - FaultType [0..*] (String) - - MinSeverity [0..1] (Integer) - - EventStartTime [0..1] (Datetime) - - EventEndTime [0..1] (Datetime) - -* PhysicalResourceStateClass: - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String): mandates the new state of the physical - resource. - - Metadata [0..*] (Key-value pair) - -* PhysicalResourceInfoClass: - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String) - - FirmwareVersion [0..1] (String) - - HypervisorVersion [0..1] (String) - - ZoneID [0..1] (Identifier) - - Metadata [0..*] (Key-value pair) - -* StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits - the query to certain physical resources, a certain zone, or a given resource - state (e.g., only resources in "maintenance" state). - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String) - - ZoneID [0..1] (Identifier) - -.. _impl_nbi: - -Detailed northbound interface specification -------------------------------------------- - -This section is specifying the northbound interfaces for fault management and -NFVI maintenance between the VIM on the one end and the Consumer and the -Administrator on the other ends. For each interface all messages and related -information elements are provided. - -Note: The interface definition will be aligned with current work in ETSI NFV IFA -working group . - -All of the interfaces described below are produced by the VIM and consumed by -the Consumer or Administrator. - -Fault management interface -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This interface allows the VIM to notify the Consumer about a virtual resource -that is affected by a fault, either within the virtual resource itself or by the -underlying virtualization infrastructure. The messages on this interface are -shown in :numref:`figure13` and explained in detail in the following -subsections. - -Note: The information elements used in this section are described in detail in -Section 5.4. - -.. figure:: images/figure13.png - :name: figure13 - :width: 100% - - Fault management NB I/F messages - - -SubscribeRequest (Consumer -> VIM) -__________________________________ - -Subscription from Consumer to VIM to be notified about faults of specific -resources. The faults to be notified about can be narrowed down using a -subscribe filter. - -Parameters: - -- SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow - down the faults that shall be notified to the Consumer, for example limit to - specific VirtualResourceID(s), severity, or cause of the alarm. - -SubscribeResponse (VIM -> Consumer) -___________________________________ - -Response to a subscribe request message including information about the -subscribed resources, in particular if they are in "fault/error" state. - -Parameters: - -* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It - can be used to delete or update the subscription. -* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional - information about the subscribed resources, i.e., a list of the related - resources, the current state of the resources, etc. - -FaultNotification (VIM -> Consumer) -___________________________________ - -Notification about a virtual resource that is affected by a fault, either within -the virtual resource itself or by the underlying virtualization infrastructure. -After reception of this request, the Consumer will decide on the optimal -action to resolve the fault. This includes actions like switching to a hot -standby virtual resource, migration of the fault virtual resource to another -physical machine, termination of the faulty virtual resource and instantiation -of a new virtual resource in order to provide a new hot standby resource. In -some use cases the Consumer can leave virtual resources on failed host to be -booted up again after fault is recovered. Existing resource management -interfaces and messages between the Consumer and the VIM can be used for those -actions, and there is no need to define additional actions on the Fault -Management Interface. - -Parameters: - -* NotificationID [1] (Identifier): Unique identifier for the notification. -* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty - resources with detailed information about the faults. - -FaultQueryRequest (Consumer -> VIM) -___________________________________ - -Request to find out about active alarms at the VIM. A FaultQueryFilter can be -used to narrow down the alarms returned in the response message. - -Parameters: - -* FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the - FaultQueryRequest, for example it limits the query to certain physical - resources, a certain zone, a given fault type/severity/cause, or a specific - FaultID. - -FaultQueryResponse (VIM -> Consumer) -____________________________________ - -List of active alarms at the VIM matching the FaultQueryFilter specified in the -FaultQueryRequest. - -Parameters: - -* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty - resources. For each resource all faults including detailed information about - the faults are provided. - -NFVI maintenance -^^^^^^^^^^^^^^^^ - -The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to -maintenance notifications provided by the VIM. The related maintenance interface -Administrator-VIM allows the Administrator to issue maintenance requests to the -VIM, i.e. requesting the VIM to take appropriate actions to empty physical -machine(s) in order to execute maintenance operations on them. The interface -also allows the Administrator to query the state of physical machines, e.g., in -order to get details in the current status of the maintenance operation like a -firmware update. - -The messages defined in these northbound interfaces are shown in :numref:`figure14` -and described in detail in the following subsections. - -.. figure:: images/figure14.png - :name: figure14 - :width: 100% - - NFVI maintenance NB I/F messages - -SubscribeRequest (Consumer -> VIM) -__________________________________ - -Subscription from Consumer to VIM to be notified about maintenance operations -for specific virtual resources. The resources to be informed about can be -narrowed down using a subscribe filter. - -Parameters: - -* SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the - faults that shall be notified to the Consumer, for example limit to specific - virtual resource type(s). - -SubscribeResponse (VIM -> Consumer) -___________________________________ - -Response to a subscribe request message, including information about the -subscribed virtual resources, in particular if they are in "maintenance" state. - -Parameters: - -* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It - can be used to delete or update the subscription. -* VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional - information about the subscribed virtual resource(s), e.g., the ID, type and - current state of the resource(s). - -MaintenanceNotification (VIM -> Consumer) -_________________________________________ - -Notification about a physical resource switched to "maintenance" state. After -reception of this request, the Consumer will decide on the optimal action to -address this request, e.g., to switch to the standby (STBY) configuration. - -Parameters: - -* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual - resources where the state has been changed to maintenance. - -StateChangeRequest (Administrator -> VIM) -_________________________________________ - -Request to change the state of a list of physical resources, e.g. to -"maintenance" state, in order to prepare them for a planned maintenance -operation. - -Parameters: - -* PhysicalResourceState [1..*] (PhysicalResourceStateClass) - -StateChangeResponse (VIM -> Administrator) -__________________________________________ - -Response message to inform the Administrator that the requested resources are -now in maintenance state (or the operation resulted in an error) and the -maintenance operation(s) can be executed. - -Parameters: - -* PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass) - -StateQueryRequest (Administrator -> VIM) -________________________________________ - -In this procedure, the Administrator would like to get the information about -physical machine(s), e.g. their state ("normal", "maintenance"), firmware -version, hypervisor version, update status of firmware and hypervisor, etc. It -can be used to check the progress during firmware update and the confirmation -after update. A filter can be used to narrow down the resources returned in the -response message. - -Parameters: - -* StateQueryFilter [1] (StateQueryFilterClass): narrows down the - StateQueryRequest, for example it limits the query to certain physical - resources, a certain zone, or a given resource state. - -StateQueryResponse (VIM -> Administrator) -_________________________________________ - -List of physical resources matching the filter specified in the -StateQueryRequest. - -Parameters: - -* PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical - resources. For each resource, information about the current state, the - firmware version, etc. is provided. - -NFV IFA, OPNFV Doctor and AODH alarms -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This section compares the alarm interfaces of ETSI NFV IFA with the specifications -of this document and the alarm class of AODH. - -ETSI NFV specifies an interface for alarms from virtualised resources in ETSI GS -NFV-IFA 005 [ENFV]_. The interface specifies an Alarm class and two notifications plus -operations to query alarm instances and to subscribe to the alarm notifications. - -The specification in this document has a structure that is very similar to the -ETSI NFV specifications. The notifications differ in that an alarm notification -in the NFV interface defines a single fault for a single resource while the -notification specified in this document can contain multiple faults for -multiple resources. The Doctor specification is lacking the detailed time stamps -of the NFV specification essential for synchronizaion of the alarm list -using the query operation. The detailed time stamps are also of value in the event -and alarm history DBs. - -AODH defines a base class for alarms, not the notifications. This means that -some of the dynamic attributes of the ETSI NFV alarm type, like alarmRaisedTime, -are not applicable to the AODH alarm class but are attributes of in the actual -notifications. (Description of these attributes will be added later.) The AODH alarm -class is lacking some attributes present in the NFV specification, fault details -and correlated alarms. Instead the AODH alarm class has attributes for actions, -rules and user and project id. - - -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations | -| | Requirement Specs | Notification | | | -+========================+========================+=====================+=============================================+=======================================+ -| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- | -| | | | is part of the AODH reason parameter. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May | -| | | | | be derived (e.g. in a shimlayer) from | -| | | | | the AODH alarm history. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| eventTime | \- | \- | Timestamp when alarm was first observed by | see above | -| | | | the Monitor. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. | -| | | | | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. | -| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be | -| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be | -| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. | -| Minor, Warning, | | moderate, critical | | | -| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: | -| | | | | * update OPNFV Doctor specification | -| | | | | to *Enum* | -| | | | | | -| | | | | **perceivedSeverity=Indetermined**: | -| | | | | * remove value *Indetermined* in | -| | | | | IFA and map undefined values to | -| | | | | “minor” severity, or | -| | | | | * add value *indetermined* in AODH | -| | | | | and make it the default value. | -| | | | | | -| | | | | **perceivedSeverity=Cleared**: | -| | | | | * remove value *Cleared* in IFA as | -| | | | | the information about a cleared | -| | | | | alarm alarm can be derived from | -| | | | | the alarm state parameter, or | -| | | | | * add value *cleared* in AODH and | -| | | | | set a rule that the severity is | -| | | | | “cleared” when the state is *ok*. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a | -| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, | -| | | | format. | "compute.cpu.failure". | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| N/A | N/A | type = "event" | Type of the notification. For fault | \- | -| | | | notifications the type in AODH is “event”. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) | -| | | | | based on Vitrage topology awareness / | -| | | | | root-cause-analysis. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above | -| | | | root cause of other faults. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will | -| | | | | be specified in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | action, previous | Additional AODH alarm related parameters. | \- | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ - -Table: Comparison of alarm attributes - -The primary area of improvement should be alignment of the perceived severity. This -is important for a quick and accurate evaluation of the alarm. AODH thus should -support also the X.733 values Critical, Major, Minor, Warning and Indeterminate. - -The detailed time stamps (raised, changed, cleared) which are essential for -synchronizing the alarm list using a query operation should be added to the -Doctor specification. - -Other areas that need alignment is the so called alarm state in NFV. Here we must -however consider what can be attributes of the notification vs. what should be a -property of the alarm instance. This will be analyzed later. - -.. _southbound: - -Detailed southbound interface specification -------------------------------------------- - -This section is specifying the southbound interfaces for fault management -between the Monitors and the Inspector. -Although southbound interfaces should be flexible to handle various events from -different types of Monitors, we define unified event API in order to improve -interoperability between the Monitors and the Inspector. -This is not limiting implementation of Monitor and Inspector as these could be -extended in order to support failures from intelligent inspection like prediction. - -Note: The interface definition will be aligned with current work in ETSI NFV IFA -working group. - -Fault event interface -^^^^^^^^^^^^^^^^^^^^^ - -This interface allows the Monitors to notify the Inspector about an event which -was captured by the Monitor and may effect resources managed in the VIM. - -EventNotification -_________________ - - -Event notification including fault description. -The entity of this notification is event, and not fault or error specifically. -This allows us to use generic event format or framework build out of Doctor project. -The parameters below shall be mandatory, but keys in 'Details' can be optional. - -Parameters: - -* Time [1]: Datetime when the fault was observed in the Monitor. -* Type [1]: Type of event that will be used to process correlation in Inspector. -* Details [0..1]: Details containing additional information with Key-value pair style. - Keys shall be defined depending on the Type of the event. - -E.g.: - -.. code-block:: bash - - { - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.down', - 'details': { - 'hostname': 'compute-1', - 'source': 'sample_monitor', - 'cause': 'link-down', - 'severity': 'critical', - 'status': 'down', - 'monitor_id': 'monitor-1', - 'monitor_event_id': '123', - } - } - } - -Optional parameters in 'Details': - -* Hostname: the hostname on which the event occurred. -* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'. -* Cause: description of the cause of this event which could be different from the type of this event. -* Severity: the severity of this event set by the monitor. -* Status: the status of target object in which error occurred. -* MonitorID: the ID of the monitor sending this event. -* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log. -* RelatedTo: the array of IDs which related to this event. - -Also, we can have bulk API to receive multiple events in a single HTTP POST -message by using the 'events' wrapper as follows: - -.. code-block:: bash - - { - 'events': [ - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.down', - 'details': {}, - }, - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.nic.error', - 'details': {}, - } - ] - } - - - - -Blueprints ----------- - -This section is listing a first set of blueprints that have been proposed by the -Doctor project to the open source community. Further blueprints addressing other -gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In -this section the following definitions are used: - -* "Event" is a message emitted by other OpenStack services such as Nova and - Neutron and is consumed by the "Notification Agents" in Ceilometer. -* "Notification" is a message generated by a "Notification Agent" in Ceilometer - based on an "event" and is delivered to the "Collectors" in Ceilometer that - store those notifications (as "sample") to the Ceilometer "Databases". - -Instance State Notification (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The Doctor project is planning to handle "events" and "notifications" regarding -Resource Status; Instance State, Port State, Host State, etc. Currently, -Ceilometer already receives "events" to identify the state of those resources, -but it does not handle and store them yet. This is why we also need a new event -definition to capture those resource states from "events" created by other -services. - -This BP proposes to add a new compute notification state to handle events from -an instance (server) from nova. It also creates a new meter "instance.state" in -OpenStack. - -.. [*] https://etherpad.opnfv.org/p/doctor_bps - -Event Publisher for Alarm (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - - The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically - querying/polling the databases in order to check all alarms independently from - other processes. This is adding additional delay to the fault notification - send to the Consumer, whereas one requirement of Doctor is to react on faults - as fast as possible. - - The existing message flow is shown in :numref:`figure12`: after receiving - an "event", a "notification agent" (i.e. "event publisher") will send a - "notification" to a "Collector". The "collector" is collecting the - notifications and is updating the Ceilometer "Meter" database that is storing - information about the "sample" which is capured from original "event". The - "Alarm Evaluator" is periodically polling this databases then querying "Meter" - database based on each alarm configuration. - - In the current Ceilometer implementation, there is no possibility to directly - trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm - Evaluator" will only find out that requires firing new notification to the - Consumer when polling the database. - -**Change/feature request:** - - This BP proposes to add a new "event publisher for alarm", which is bypassing - several steps in Ceilometer in order to avoid the polling-based approach of - the existing Alarm Evaluator that makes notification slow to users. - - After receiving an "(alarm) event" by listening on the Ceilometer message - queue ("notification bus"), the new "event publisher for alarm" immediately - hands a "notification" about this event to a new Ceilometer component - "Notification-driven alarm evaluator" proposed in the other BP (see Section - 5.6.3). - - Note, the term "publisher" refers to an entity in the Ceilometer architecture - (it is a "notification agent"). It offers the capability to provide - notifications to other services outside of Ceilometer, but it is also used to - deliver notifications to other Ceilometer components (e.g. the "Collectors") - via the Ceilometer "notification bus". - -**Implementation detail** - - * "Event publisher for alarm" is part of Ceilometer - * The standard AMQP message queue is used with a new topic string. - * No new interfaces have to be added to Ceilometer. - * "Event publisher for Alarm" can be configured by the Administrator of - Ceilometer to be used as "Notification Agent" in addition to the existing - "Notifier" - * Existing alarm mechanisms of Ceilometer can be used allowing users to - configure how to distribute the "notifications" transformed from "events", - e.g. there is an option whether an ongoing alarm is re-issued or not - ("repeat_actions"). - -.. [*] https://etherpad.opnfv.org/p/doctor_bps - -Notification-driven alarm evaluator (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - -The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically -querying/polling the databases in order to check all alarms independently from -other processes. This is adding additional delay to the fault notification send -to the Consumer, whereas one requirement of Doctor is to react on faults as fast -as possible. - -**Change/feature request:** - -This BP is proposing to add an alternative "Notification-driven Alarm Evaluator" -for Ceilometer that is receiving "notifications" sent by the "Event Publisher -for Alarm" described in the other BP. Once this new "Notification-driven Alarm -Evaluator" received "notification", it finds the "alarm" configurations which -may relate to the "notification" by querying the "alarm" database with some keys -i.e. resource ID, then it will evaluate each alarm with the information in that -"notification". - -After the alarm evaluation, it will perform the same way as the existing "alarm -evaluator" does for firing alarm notification to the Consumer. Similar to the -existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is -aggregating and correlating different alarms which are then provided northbound -to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can -register the alarm configuration via existing Ceilometer API [*]_. Thereby, he -can configure whether to set an alarm or not and where to send the alarms to. - -**Implementation detail** - -* The new "Notification-driven Alarm Evaluator" is part of Ceilometer. -* Most of the existing source code of the "Alarm Evaluator" can be re-used to - implement this BP -* No additional application logic is needed -* It will access the Ceilometer Databases just like the existing "Alarm - evaluator" -* Only the polling-based approach will be replaced by a listener for - "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer - "notification bus". -* No new interfaces have to be added to Ceilometer. - - -.. [*] https://etherpad.opnfv.org/p/doctor_bps -.. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting - -Report host fault to update server state immediately (Nova) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - -* Nova state change for failed or unreachable host is slow and does not reliably - state host is down or not. This might cause same server instance to run twice - if action taken to evacuate instance to another host. -* Nova state for server(s) on failed host will not change, but remains active - and running. This gives the user false information about server state. -* VIM northbound interface notification of host faults towards VNFM and NFVO - should be in line with OpenStack state. This fault notification is a Telco - requirement defined in ETSI and will be implemented by OPNFV Doctor project. -* Openstack user cannot make HA actions fast and reliably by trusting server - state and host state. - -**Proposed change:** - -There needs to be a new API for Admin to state host is down. This API is used to -mark services running in host down to reflect the real situation. - -Example on compute node is: - -* When compute node is up and running::: - - vm_state: activeand power_state: running - nova-compute state: up status: enabled - -* When compute node goes down and new API is called to state host is down::: - - vm_state: stopped power_state: shutdown - nova-compute state: down status: enabled - -**Alternatives:** - -There is no attractive alternative to detect all different host faults than to -have an external tool to detect different host faults. For this kind of tool to -exist there needs to be new API in Nova to report fault. Currently there must be -some kind of workarounds implemented as cannot trust or get the states from -OpenStack fast enough. - -.. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately - -Other related BPs -^^^^^^^^^^^^^^^^^ - -This section lists some BPs related to Doctor, but proposed by drafters outside -the OPNFV community. - -pacemaker-servicegroup-driver [*]_ -__________________________________ - -This BP will detect and report host down quite fast to OpenStack. This however -might not work properly for example when management network has some problem and -host reported faulty while VM still running there. This might lead to launching -same VM instance twice causing problems. Also NB IF message needs fault reason -and for that the source needs to be a tool that detects different kind of faults -as Doctor will be doing. Also this BP might need enhancement to change server -and service states correctly. - -.. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver diff --git a/docs/requirements/06-summary.rst b/docs/requirements/06-summary.rst deleted file mode 100644 index 61bf3f47..00000000 --- a/docs/requirements/06-summary.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Summary and conclusion -====================== - -The Doctor project aimed at detailing NFVI fault management and NFVI maintenance -requirements. These are indispensable operations for an Operator, and extremely -necessary to realize telco-grade high availability. High availability is a large -topic; the objective of Doctor is not to realize a complete high availability -architecture and implementation. Instead, Doctor limited itself to addressing -the fault events in NFVI, and proposes enhancements necessary in VIM, e.g. -OpenStack, to ensure VNFs availability in such fault events, taking a Telco VNFs -application level management system into account. - -The Doctor project performed a robust analysis of the requirements from NFVI -fault management and NFVI maintenance operation, concretely found out gaps in -between such requirements and the current implementation of OpenStack, and -proposed potential development plans to fill out such gaps in OpenStack. -Blueprints are already under investigation and the next step is to fill out -those gaps in OpenStack by code development in the coming releases. - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/07-annex.rst b/docs/requirements/07-annex.rst deleted file mode 100644 index c3a7899d..00000000 --- a/docs/requirements/07-annex.rst +++ /dev/null @@ -1,129 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -.. _nfvi_faults: - -Annex: NFVI Faults -================================================= - -Faults in the listed elements need to be immediately notified to the Consumer in -order to perform an immediate action like live migration or switch to a hot -standby entity. In addition, the Administrator of the host should trigger a -maintenance action to, e.g., reboot the server or replace a defective hardware -element. - -Faults can be of different severity, i.e., critical, warning, or -info. Critical faults require immediate action as a severe degradation of the -system has happened or is expected. Warnings indicate that the system -performance is going down: related actions include closer (e.g. more frequent) -monitoring of that part of the system or preparation for a cold migration to a -backup VM. Info messages do not require any action. We also consider a type -"maintenance", which is no real fault, but may trigger maintenance actions -like a re-boot of the server or replacement of a faulty, but redundant HW. - -Faults can be gathered by, e.g., enabling SNMP and installing some open source -tools to catch and poll SNMP. When using for example Zabbix one can also put an -agent running on the hosts to catch any other fault. In any case of failure, the -Administrator should be notified. The following tables provide a list of high -level faults that are considered within the scope of the Doctor project -requiring immediate action by the Consumer. - -**Compute/Storage** - -+-------------------+----------+------------+-----------------+------------------+ -| Fault | Severity | How to | Comment | Immediate action | -| | | detect? | | to recover | -+===================+==========+============+=================+==================+ -| Processor/CPU | Critical | Zabbix | | Switch to hot | -| failure, CPU | | | | standby | -| condition not ok | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Memory failure/ | Critical | Zabbix | | Switch to hot | -| Memory condition | | (IPMI) | | standby | -| not ok | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Network card | Critical | Zabbix/ | | Switch to hot | -| failure, e.g. | | Ceilometer | | standby | -| network adapter | | | | | -| connectivity lost | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Disk crash | Info | RAID | Network storage | Inform OAM | -| | | monitoring | is very | | -| | | | redundant (e.g. | | -| | | | RAID system) | | -| | | | and can | | -| | | | guarantee high | | -| | | | availability | | -+-------------------+----------+------------+-----------------+------------------+ -| Storage | Critical | Zabbix | | Live migration | -| controller | | (IPMI) | | if storage | -| | | | | is still | -| | | | | accessible; | -| | | | | otherwise hot | -| | | | | standby | -+-------------------+----------+------------+-----------------+------------------+ -| PDU/power | Critical | Zabbix/ | | Switch to hot | -| failure, power | | Ceilometer | | standby | -| off, server reset | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Power | Warning | SNMP | | Live migration | -| degration, power | | | | | -| redundancy lost, | | | | | -| power threshold | | | | | -| exceeded | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Chassis problem | Warning | SNMP | | Live migration | -| (e.g. fan | | | | | -| degraded/failed, | | | | | -| chassis power | | | | | -| degraded), CPU | | | | | -| fan problem, | | | | | -| temperature/ | | | | | -| thermal condition | | | | | -| not ok | | | | | -+-------------------+----------+------------+-----------------+------------------+ -| Mainboard failure | Critical | Zabbix | e.g. PCIe, SAS | Switch to hot | -| | | (IPMI) | link failure | standby | -+-------------------+----------+------------+-----------------+------------------+ -| OS crash (e.g. | Critical | Zabbix | | Switch to hot | -| kernel panic) | | | | standby | -+-------------------+----------+------------+-----------------+------------------+ - -**Hypervisor** - -+----------------+----------+------------+-------------+-------------------+ -| Fault | Severity | How to | Comment | Immediate action | -| | | detect? | | to recover | -+================+==========+============+=============+===================+ -| System has | Critical | Zabbix | | Switch to hot | -| restarted | | | | standby | -+----------------+----------+------------+-------------+-------------------+ -| Hypervisor | Warning/ | Zabbix/ | | Evacuation/switch | -| failure | Critical | Ceilometer | | to hot standby | -+----------------+----------+------------+-------------+-------------------+ -| Hypervisor | Warning | Alarming | Zabbix/ | Rebuild VM | -| status not | | service | Ceilometer | | -| retrievable | | | unreachable | | -| after certain | | | | | -| period | | | | | -+----------------+----------+------------+-------------+-------------------+ - -**Network** - -+------------------+----------+---------+----------------+---------------------+ -| Fault | Severity | How to | Comment | Immediate action to | -| | | detect? | | recover | -+==================+==========+=========+================+=====================+ -| SDN/OpenFlow | Critical | Ceilo- | | Switch to | -| switch, | | meter | | hot standby | -| controller | | | | or reconfigure | -| degraded/failed | | | | virtual network | -| | | | | topology | -+------------------+----------+---------+----------------+---------------------+ -| Hardware failure | Warning | SNMP | Redundancy of | Live migration if | -| of physical | | | physical | possible otherwise | -| switch/router | | | infrastructure | evacuation | -| | | | is reduced or | | -| | | | no longer | | -| | | | available | | -+------------------+----------+---------+----------------+---------------------+ diff --git a/docs/requirements/99-references.rst b/docs/requirements/99-references.rst deleted file mode 100644 index 0fd3a36a..00000000 --- a/docs/requirements/99-references.rst +++ /dev/null @@ -1,32 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -References and bibliography -=========================== - -.. [DOCT] OPNFV, "Doctor" requirements project, [Online]. Available at - https://wiki.opnfv.org/doctor -.. [PRED] OPNFV, "Data Collection for Failure Prediction" requirements project - [Online]. Available at https://wiki.opnfv.org/prediction -.. [OPSK] OpenStack, [Online]. Available at https://www.openstack.org/ -.. [CEIL] OpenStack Telemetry (Ceilometer), [Online]. Available at - https://wiki.openstack.org/wiki/Ceilometer -.. [NOVA] OpenStack Nova, [Online]. Available at - https://wiki.openstack.org/wiki/Nova -.. [NEUT] OpenStack Neutron, [Online]. Available at - https://wiki.openstack.org/wiki/Neutron -.. [CIND] OpenStack Cinder, [Online]. Available at - https://wiki.openstack.org/wiki/Cinder -.. [MONA] OpenStack Monasca, [Online], Available at - https://wiki.openstack.org/wiki/Monasca -.. [OSAG] OpenStack Cloud Administrator Guide, [Online]. Available at - http://docs.openstack.org/admin-guide-cloud/content/ -.. [ZABB] ZABBIX, the Enterprise-class Monitoring Solution for Everyone, - [Online]. Available at http://www.zabbix.com/ -.. [ENFV] ETSI NFV, [Online]. Available at - http://www.etsi.org/technologies-clusters/technologies/nfv - - - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/glossary.rst b/docs/requirements/glossary.rst deleted file mode 100644 index 2c82b37f..00000000 --- a/docs/requirements/glossary.rst +++ /dev/null @@ -1,89 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -**Definition of terms** - -Different SDOs and communities use different terminology related to -NFV/Cloud/SDN. This list tries to define an OPNFV terminology, -mapping/translating the OPNFV terms to terminology used in other contexts. - - -.. glossary:: - - ACT-STBY configuration - Failover configuration common in Telco deployments. It enables the - operator to use a standby (STBY) instance to take over the functionality - of a failed active (ACT) instance. - - Administrator - Administrator of the system, e.g. OAM in Telco context. - - Consumer - User-side Manager; consumer of the interfaces produced by the VIM; VNFM, - NFVO, or Orchestrator in ETSI NFV [ENFV]_ terminology. - - EPC - Evolved Packet Core, the main component of the core network architecture - of 3GPP's LTE communication standard. - - MME - Mobility Management Entity, an entity in the EPC dedicated to mobility - management. - - NFV - Network Function Virtualization - - NFVI - Network Function Virtualization Infrastructure; totality of all hardware - and software components which build up the environment in which VNFs are - deployed. - - S/P-GW - Serving/PDN-Gateway, two entities in the EPC dedicated to routing user - data packets and providing connectivity from the UE to external packet - data networks (PDN), respectively. - - Physical resource - Actual resources in NFVI; not visible to Consumer. - - VNFM - Virtualized Network Function Manager; functional block that is - responsible for the lifecycle management of VNF. - - NFVO - Network Functions Virtualization Orchestrator; functional block that - manages the Network Service (NS) lifecycle and coordinates the - management of NS lifecycle, VNF lifecycle (supported by the VNFM) and - NFVI resources (supported by the VIM) to ensure an optimized allocation - of the necessary resources and connectivity. - - VIM - Virtualized Infrastructure Manager; functional block that is responsible - for controlling and managing the NFVI compute, storage and network - resources, usually within one operator's Infrastructure Domain, e.g. - NFVI Point of Presence (NFVI-PoP). - - Virtual Machine (VM) - Virtualized computation environment that behaves very much like a - physical computer/server. - - Virtual network - Virtual network routes information among the network interfaces of VM - instances and physical network interfaces, providing the necessary - connectivity. - - Virtual resource - A Virtual Machine (VM), a virtual network, or virtualized storage; - Offered resources to "Consumer" as result of infrastructure - virtualization; visible to Consumer. - - Virtual Storage - Virtualized non-volatile storage allocated to a VM. - - VNF - Virtualized Network Function. Implementation of a Network Function that - can be deployed on a Network Function Virtualization Infrastructure - (NFVI). - -.. - vim: set tabstop=4 expandtab textwidth=80: diff --git a/docs/requirements/images/LICENSE b/docs/requirements/images/LICENSE deleted file mode 100644 index f2b47d20..00000000 --- a/docs/requirements/images/LICENSE +++ /dev/null @@ -1,14 +0,0 @@ -Copyright 2015 Open Platform for NFV Project, Inc. and its contributors - -Open Platform for NFV Project Documentation License -=================================================== -Any documentation developed by the "Open Platform for NFV Project" -is licensed under a Creative Commons Attribution 4.0 International License. -You should have received a copy of the license along with this. If not, -see . - -Unless required by applicable law or agreed to in writing, documentation -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. diff --git a/docs/requirements/images/figure1.png b/docs/requirements/images/figure1.png deleted file mode 100644 index 267ddddc..00000000 Binary files a/docs/requirements/images/figure1.png and /dev/null differ diff --git a/docs/requirements/images/figure10.png b/docs/requirements/images/figure10.png deleted file mode 100755 index d3268018..00000000 Binary files a/docs/requirements/images/figure10.png and /dev/null differ diff --git a/docs/requirements/images/figure11.png b/docs/requirements/images/figure11.png deleted file mode 100755 index b5fe0f8c..00000000 Binary files a/docs/requirements/images/figure11.png and /dev/null differ diff --git a/docs/requirements/images/figure12.png b/docs/requirements/images/figure12.png deleted file mode 100755 index 2d394629..00000000 Binary files a/docs/requirements/images/figure12.png and /dev/null differ diff --git a/docs/requirements/images/figure13.png b/docs/requirements/images/figure13.png deleted file mode 100755 index 5f8227a5..00000000 Binary files a/docs/requirements/images/figure13.png and /dev/null differ diff --git a/docs/requirements/images/figure14.png b/docs/requirements/images/figure14.png deleted file mode 100755 index b65ca9ae..00000000 Binary files a/docs/requirements/images/figure14.png and /dev/null differ diff --git a/docs/requirements/images/figure2.png b/docs/requirements/images/figure2.png deleted file mode 100644 index 9a3b166d..00000000 Binary files a/docs/requirements/images/figure2.png and /dev/null differ diff --git a/docs/requirements/images/figure3.png b/docs/requirements/images/figure3.png deleted file mode 100755 index ee04dfae..00000000 Binary files a/docs/requirements/images/figure3.png and /dev/null differ diff --git a/docs/requirements/images/figure4.png b/docs/requirements/images/figure4.png deleted file mode 100755 index 9eff177a..00000000 Binary files a/docs/requirements/images/figure4.png and /dev/null differ diff --git a/docs/requirements/images/figure5a.png b/docs/requirements/images/figure5a.png deleted file mode 100755 index d347b412..00000000 Binary files a/docs/requirements/images/figure5a.png and /dev/null differ diff --git a/docs/requirements/images/figure5b.png b/docs/requirements/images/figure5b.png deleted file mode 100755 index 75a43669..00000000 Binary files a/docs/requirements/images/figure5b.png and /dev/null differ diff --git a/docs/requirements/images/figure5c.png b/docs/requirements/images/figure5c.png deleted file mode 100755 index 4fb2ba03..00000000 Binary files a/docs/requirements/images/figure5c.png and /dev/null differ diff --git a/docs/requirements/images/figure6.png b/docs/requirements/images/figure6.png deleted file mode 100755 index cf0d2be9..00000000 Binary files a/docs/requirements/images/figure6.png and /dev/null differ diff --git a/docs/requirements/images/figure7.png b/docs/requirements/images/figure7.png deleted file mode 100755 index b88a2e65..00000000 Binary files a/docs/requirements/images/figure7.png and /dev/null differ diff --git a/docs/requirements/images/figure8.png b/docs/requirements/images/figure8.png deleted file mode 100755 index 907a0b30..00000000 Binary files a/docs/requirements/images/figure8.png and /dev/null differ diff --git a/docs/requirements/images/figure9.png b/docs/requirements/images/figure9.png deleted file mode 100755 index 61501c4d..00000000 Binary files a/docs/requirements/images/figure9.png and /dev/null differ diff --git a/docs/requirements/index.rst b/docs/requirements/index.rst deleted file mode 100644 index fcbfb88e..00000000 --- a/docs/requirements/index.rst +++ /dev/null @@ -1,62 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -**************************************** -Doctor: Fault Management and Maintenance -**************************************** - -:Project: Doctor, https://wiki.opnfv.org/doctor -:Editors: Ashiq Khan (NTT DOCOMO), Gerald Kunzmann (NTT DOCOMO) -:Authors: Ryota Mibu (NEC), Carlos Goncalves (NEC), Tomi Juvonen (Nokia), - Tommy Lindgren (Ericsson), Bertrand Souville (NTT DOCOMO), - Balazs Gibizer (Ericsson), Ildiko Vancsa (Ericsson) and others. - -:Abstract: Doctor is an OPNFV requirement project [DOCT]_. Its scope is NFVI - fault management, and maintenance and it aims at developing and - realizing the consequent implementation for the OPNFV reference - platform. - - This deliverable is introducing the use cases and operational - scenarios for Fault Management considered in the Doctor project. - From the general features, a high level architecture describing - logical building blocks and interfaces is derived. Finally, - a detailed implementation is introduced, based on available open - source components, and a related gap analysis is done as part of - this project. The implementation plan finally discusses an initial - realization for a NFVI fault management and maintenance solution in - open source software. - -:History: - - ========== ===================================================== - Date Description - ========== ===================================================== - 02.12.2014 Project creation - 14.04.2015 Initial version of the deliverable uploaded to Gerrit - 18.05.2015 Stable version of the Doctor deliverable - 25.02.2016 Updated version for the Brahmaputra release - 26.09.2016 Updated version for the Colorado release - xx.xx.2017 Updated version for the Danube release - ========== ===================================================== - -.. raw:: latex - - \newpage - -.. include:: - glossary.rst - -.. toctree:: - :maxdepth: 4 - :numbered: - - 01-intro.rst - 02-use_cases.rst - 03-architecture.rst - 04-gaps.rst - 05-implementation.rst - 06-summary.rst - 07-annex.rst - -.. include:: - 99-references.rst diff --git a/docs/scenarios/functest/doctor-scenario-in-functest.rst b/docs/scenarios/functest/doctor-scenario-in-functest.rst deleted file mode 100644 index b3d73d5c..00000000 --- a/docs/scenarios/functest/doctor-scenario-in-functest.rst +++ /dev/null @@ -1,126 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - - - -Platform overview -""""""""""""""""" - -Doctor platform provides these features in `Danube Release `_: - -* Immediate Notification -* Consistent resource state awareness for compute host down -* Valid compute host status given to VM owner - -These features enable high availability of Network Services on top of -the virtualized infrastructure. Immediate notification allows VNF managers -(VNFM) to process recovery actions promptly once a failure has occurred. - -Consistency of resource state is necessary to execute recovery actions -properly in the VIM. - -Ability to query host status gives VM owner the possibility to get -consistent state information through an API in case of a compute host -fault. - -The Doctor platform consists of the following components: - -* OpenStack Compute (Nova) -* OpenStack Telemetry (Ceilometer) -* OpenStack Alarming (Aodh) -* Doctor Inspector -* Doctor Monitor - -.. note:: - Doctor Inspector and Monitor are sample implementations for reference. - -You can see an overview of the Doctor platform and how components interact in -:numref:`figure-p1`. - -.. figure:: ./images/figure-p1.png - :name: figure-p1 - :width: 100% - - Doctor platform and typical sequence - -Detailed information on the Doctor architecture can be found in the Doctor -requirements documentation: -http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html - -Use case -"""""""" - -* A consumer of the NFVI wants to receive immediate notifications about faults - in the NFVI affecting the proper functioning of the virtual resources. - Therefore, such faults have to be detected as quickly as possible, and, when - a critical error is observed, the affected consumer is immediately informed - about the fault and can switch over to the STBY configuration. - -The faults to be monitored (and at which detection rate) will be configured by -the consumer. Once a fault is detected, the Inspector in the Doctor -architecture will check the resource map maintained by the Controller, to find -out which virtual resources are affected and then update the resources state. -The Notifier will receive the failure event requests sent from the Controller, -and notify the consumer(s) of the affected resources according to the alarm -configuration. - -Detailed workflow information is as follows: - -* Consumer(VNFM): (step 0) creates resources (network, server/instance) and an - event alarm on state down notification of that server/instance - -* Monitor: (step 1) periodically checks nodes, such as ping from/to each - dplane nic to/from gw of node, (step 2) once it fails to send out event - with "raw" fault event information to Inspector - -* Inspector: when it receives an event, it will (step 3) mark the host down - ("mark-host-down"), (step 4) map the PM to VM, and change the VM status to - down - -* Controller: (step 5) sends out instance update event to Ceilometer - -* Notifier: (step 6) Ceilometer transforms and passes the event to Aodh, - (step 7) Aodh will evaluate event with the registered alarm definitions, - then (step 8) it will fire the alarm to the "consumer" who owns the - instance - -* Consumer(VNFM): (step 9) receives the event and (step 10) recreates a new - instance - -Test case -""""""""" - -Functest will call the "run.sh" script in Doctor to run the test job. - -Currently, only 'Apex' and 'local' installer are supported. The test also -can run successfully in 'fuel' installer with the modification of some -configurations of OpenStack in the script. But still need 'fuel' installer -to support these configurations. - -The "run.sh" script will execute the following steps. - -Firstly, get the installer ip according to the installer type. Then ssh to -the installer node to get the private key for accessing to the cloud. As -'fuel' installer, ssh to the controller node to modify nova and ceilometer -configurations. - -Secondly, prepare image for booting VM, then create a test project and test -user (both default to doctor) for the Doctor tests. - -Thirdly, boot a VM under the doctor project and check the VM status to verify -that the VM is launched completely. Then get the compute host info where the VM -is launched to verify connectivity to the target compute host. Get the consumer -ip according to the route to compute ip and create an alarm event in Ceilometer -using the consumer ip. - -Fourthly, the Doctor components are started, and, based on the above preparation, -a failure is injected to the system, i.e. the network of compute host is -disabled for 3 minutes. To ensure the host is down, the status of the host -will be checked. - -Finally, the notification time, i.e. the time between the execution of step 2 -(Monitor detects failure) and step 9 (Consumer receives failure notification) -is calculated. - -According to the Doctor requirements, the Doctor test is successful if the -notification time is below 1 second. diff --git a/docs/scenarios/functest/images/LICENSE b/docs/scenarios/functest/images/LICENSE deleted file mode 100644 index f2b47d20..00000000 --- a/docs/scenarios/functest/images/LICENSE +++ /dev/null @@ -1,14 +0,0 @@ -Copyright 2015 Open Platform for NFV Project, Inc. and its contributors - -Open Platform for NFV Project Documentation License -=================================================== -Any documentation developed by the "Open Platform for NFV Project" -is licensed under a Creative Commons Attribution 4.0 International License. -You should have received a copy of the license along with this. If not, -see . - -Unless required by applicable law or agreed to in writing, documentation -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. diff --git a/docs/scenarios/functest/images/figure-p1.png b/docs/scenarios/functest/images/figure-p1.png deleted file mode 100755 index e963d8bd..00000000 Binary files a/docs/scenarios/functest/images/figure-p1.png and /dev/null differ diff --git a/docs/scenarios/index.rst b/docs/scenarios/index.rst deleted file mode 100644 index 3732e027..00000000 --- a/docs/scenarios/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -=============== -Doctor scenario -=============== - -.. toctree:: - :maxdepth: 2 - :numbered: - - ./functest/doctor-scenario-in-functest.rst diff --git a/docs/userguide/feature.userguide.rst b/docs/userguide/feature.userguide.rst deleted file mode 100644 index 4ae521bd..00000000 --- a/docs/userguide/feature.userguide.rst +++ /dev/null @@ -1,44 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Doctor capabilities and usage -============================= -Immediate Notification ----------------------- - -Immediate notification can be used by creating 'event' type alarm via -OpenStack Alarming (Aodh) API with relevant internal components support. - -See, upstream spec document: -http://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html - -An example of a consumer of this notification can be found in the Doctor -repository. It can be executed as follows: - -.. code-block:: bash - - git clone https://gerrit.opnfv.org/gerrit/doctor -b stable/danube - cd doctor/tests - CONSUMER_PORT=12346 - python consumer.py "$CONSUMER_PORT" > consumer.log 2>&1 & - -Consistent resource state awareness ------------------------------------ - -Resource state of compute host can be changed/updated according to a trigger -from a monitor running outside of OpenStack Compute (Nova) by using -force-down API. - -See -http://artifacts.opnfv.org/doctor/danube/manuals/mark-host-down_manual.html -for more detail. - -Valid compute host status given to VM owner -------------------------------------------- - -The resource state of a compute host can be retrieved by a user with the -OpenStack Compute (Nova) servers API. - -See -http://artifacts.opnfv.org/doctor/danube/manuals/get-valid-server-state.html -for more detail. diff --git a/docs/userguide/index.rst b/docs/userguide/index.rst deleted file mode 100644 index c6830fd1..00000000 --- a/docs/userguide/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -***************** -Doctor User Guide -***************** - -.. toctree:: - :maxdepth: 2 - :numbered: - - feature.userguide.rst -- cgit 1.2.3-korg