diff options
Diffstat (limited to 'docs/development/design')
-rw-r--r-- | docs/development/design/index.rst | 27 | ||||
-rw-r--r-- | docs/development/design/inspector-design-guideline.rst | 46 | ||||
-rw-r--r-- | docs/development/design/notification-alarm-evaluator.rst | 248 | ||||
-rw-r--r-- | docs/development/design/performance-profiler.rst | 118 | ||||
-rw-r--r-- | docs/development/design/port-data-plane-status.rst | 180 | ||||
-rw-r--r-- | docs/development/design/report-host-fault-to-update-server-state-immediately.rst | 248 | ||||
-rw-r--r-- | docs/development/design/rfe-port-status-update.rst | 32 |
7 files changed, 899 insertions, 0 deletions
diff --git a/docs/development/design/index.rst b/docs/development/design/index.rst new file mode 100644 index 00000000..963002a0 --- /dev/null +++ b/docs/development/design/index.rst @@ -0,0 +1,27 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +**************** +Design Documents +**************** + +This is the directory to store design documents which may include draft +versions of blueprints written before proposing to upstream OSS communities +such as OpenStack, in order to keep the original blueprint as reviewed in +OPNFV. That means there could be out-dated blueprints as result of further +refinements in the upstream OSS community. Please refer to the link in each +document to find the latest version of the blueprint and status of development +in the relevant OSS community. + +See also https://wiki.opnfv.org/requirements_projects . + +.. toctree:: + :numbered: + :maxdepth: 4 + + report-host-fault-to-update-server-state-immediately.rst + notification-alarm-evaluator.rst + rfe-port-status-update.rst + port-data-plane-status.rst + inspector-design-guideline.rst + performance-profiler.rst diff --git a/docs/development/design/inspector-design-guideline.rst b/docs/development/design/inspector-design-guideline.rst new file mode 100644 index 00000000..4add8c0f --- /dev/null +++ b/docs/development/design/inspector-design-guideline.rst @@ -0,0 +1,46 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +========================== +Inspector Design Guideline +========================== + +.. NOTE:: + This is spec draft of design guideline for inspector component. + JIRA ticket to track the update and collect comments: `DOCTOR-73`_. + +This document summarize the best practise in designing a high performance +inspector to meet the requirements in `OPNFV Doctor project`_. + +Problem Description +=================== + +Some pitfalls has be detected during the development of sample inspector, e.g. +we suffered a significant `performance degrading in listing VMs in a host`_. + +A `patch set for caching the list`_ has been committed to solve issue. When a +new inspector is integrated, it would be nice to have an evaluation of existing +design and give recommendations for improvements. + +This document can be treated as a source of related blueprints in inspector +projects. + +Guidelines +========== + +Host specific VMs list +---------------------- + +TBD, see `DOCTOR-76`_. + +Parallel execution +------------------ + +TBD, see `discussion in mailing list`_. + +.. _DOCTOR-73: https://jira.opnfv.org/browse/DOCTOR-73 +.. _OPNFV Doctor project: https://wiki.opnfv.org/doctor +.. _performance degrading in listing VMs in a host: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-September/012591.html +.. _patch set for caching the list: https://gerrit.opnfv.org/gerrit/#/c/20877/ +.. _DOCTOR-76: https://jira.opnfv.org/browse/DOCTOR-76 +.. _discussion in mailing list: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2016-October/013036.html diff --git a/docs/development/design/notification-alarm-evaluator.rst b/docs/development/design/notification-alarm-evaluator.rst new file mode 100644 index 00000000..d1bf787a --- /dev/null +++ b/docs/development/design/notification-alarm-evaluator.rst @@ -0,0 +1,248 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +============================ +Notification Alarm Evaluator +============================ + +.. NOTE:: + This is spec draft of blueprint for OpenStack Ceilomter Liberty. + To see current version: https://review.openstack.org/172893 + To track development activity: + https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator + +https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator + +This blueprint proposes to add a new alarm evaluator for handling alarms on +events passed from other OpenStack services, that provides event-driven alarm +evaluation which makes new sequence in Ceilometer instead of the polling-based +approach of the existing Alarm Evaluator, and realizes immediate alarm +notification to end users. + +Problem description +=================== + +As an end user, I need to receive alarm notification immediately once +Ceilometer captured an event which would make alarm fired, so that I can +perform recovery actions promptly to shorten downtime of my service. +The typical use case is that an end user set alarm on "compute.instance.update" +in order to trigger recovery actions once the instance status has changed to +'shutdown' or 'error'. It should be nice that an end user can receive +notification within 1 second after fault observed as the same as other helth- +check mechanisms can do in some cases. + +The existing Alarm Evaluator is periodically querying/polling the databases +in order to check all alarms independently from other processes. This is good +approach for evaluating an alarm on samples stored in a certain period. +However, this is not efficient to evaluate an alarm on events which are emitted +by other OpenStack servers once in a while. + +The periodical evaluation leads delay on sending alarm notification to users. +The default period of evaluation cycle is 60 seconds. It is recommended that +an operator set longer interval than configured pipeline interval for +underlying metrics, and also longer enough to evaluate all defined alarms +in certain period while taking into account the number of resources, users and +alarms. + +Proposed change +=============== + +The proposal is to add a new event-driven alarm evaluator which receives +messages from Notification Agent and finds related Alarms, then evaluates each +alarms; + +* New alarm evaluator could receive event notification from Notification Agent + by which adding a dedicated notifier as a publisher in pipeline.yaml + (e.g. notifier://?topic=event_eval). + +* When new alarm evaluator received event notification, it queries alarm + database by Project ID and Resource ID written in the event notification. + +* Found alarms are evaluated by referring event notification. + +* Depending on the result of evaluation, those alarms would be fired through + Alarm Notifier as the same as existing Alarm Evaluator does. + +This proposal also adds new alarm type "notification" and "notification_rule". +This enables users to create alarms on events. The separation from other alarm +types (such as "threshold" type) is intended to show different timing of +evaluation and different format of condition, since the new evaluator will +check each event notification once it received whereas "threshold" alarm can +evaluate average of values in certain period calculated from multiple samples. + +The new alarm evaluator handles Notification type alarms, so we have to change +existing alarm evaluator to exclude "notification" type alarms from evaluation +targets. + +Alternatives +------------ + +There was similar blueprint proposal "Alarm type based on notification", but +the approach is different. The old proposal was to adding new step (alarm +evaluations) in Notification Agent every time it received event from other +OpenStack services, whereas this proposal intends to execute alarm evaluation +in another component which can minimize impact to existing pipeline processing. + +Another approach is enhancement of existing alarm evaluator by adding +notification listener. However, there are two issues; 1) this approach could +cause stall of periodical evaluations when it receives bulk of notifications, +and 2) this could break the alarm portioning i.e. when alarm evaluator received +notification, it might have to evaluate some alarms which are not assign to it. + +Data model impact +----------------- + +Resource ID will be added to Alarm model as an optional attribute. +This would help the new alarm evaluator to filter out non-related alarms +while querying alarms, otherwise it have to evaluate all alarms in the project. + +REST API impact +--------------- + +Alarm API will be extended as follows; + +* Add "notification" type into alarm type list +* Add "resource_id" to "alarm" +* Add "notification_rule" to "alarm" + +Sample data of Notification-type alarm:: + + { + "alarm_actions": [ + "http://site:8000/alarm" + ], + "alarm_id": null, + "description": "An alarm", + "enabled": true, + "insufficient_data_actions": [ + "http://site:8000/nodata" + ], + "name": "InstanceStatusAlarm", + "notification_rule": { + "event_type": "compute.instance.update", + "query" : [ + { + "field" : "traits.state", + "type" : "string", + "value" : "error", + "op" : "eq", + }, + ] + }, + "ok_actions": [], + "project_id": "c96c887c216949acbdfbd8b494863567", + "repeat_actions": false, + "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856", + "severity": "moderate", + "state": "ok", + "state_timestamp": "2015-04-03T17:49:38.406845", + "timestamp": "2015-04-03T17:49:38.406839", + "type": "notification", + "user_id": "c96c887c216949acbdfbd8b494863567" + } + +"resource_id" will be refered to query alarm and will not be check permission +and belonging of project. + +Security impact +--------------- + +None + +Pipeline impact +--------------- + +None + +Other end user impact +--------------------- + +None + +Performance/Scalability Impacts +------------------------------- + +When Ceilomter received a number of events from other OpenStack services in +short period, this alarm evaluator can keep working since events are queued in +a messaging queue system, but it can cause delay of alarm notification to users +and increase the number of read and write access to alarm database. + +"resource_id" can be optional, but restricting it to mandatory could be reduce +performance impact. If user create "notification" alarm without "resource_id", +those alarms will be evaluated every time event occurred in the project. +That may lead new evaluator heavy. + +Other deployer impact +--------------------- + +New service process have to be run. + +Developer impact +---------------- + +Developers should be aware that events could be notified to end users and avoid +passing raw infra information to end users, while defining events and traits. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + r-mibu + +Other contributors: + None + +Ongoing maintainer: + None + +Work Items +---------- + +* New event-driven alarm evaluator + +* Add new alarm type "notification" as well as AlarmNotificationRule + +* Add "resource_id" to Alarm model + +* Modify existing alarm evaluator to filter out "notification" alarms + +* Add new config parameter for alarm request check whether accepting alarms + without specifying "resource_id" or not + +Future lifecycle +================ + +This proposal is key feature to provide information of cloud resources to end +users in real-time that enables efficient integration with user-side manager +or Orchestrator, whereas currently those information are considered to be +consumed by admin side tool or service. +Based on this change, we will seek orchestrating scenarios including fault +recovery and add useful event definition as well as additional traits. + +Dependencies +============ + +None + +Testing +======= + +New unit/scenario tests are required for this change. + +Documentation Impact +==================== + +* Proposed evaluator will be described in the developer document. + +* New alarm type and how to use will be explained in user guide. + +References +========== + +* OPNFV Doctor project: https://wiki.opnfv.org/doctor + +* Blueprint "Alarm type based on notification": + https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification diff --git a/docs/development/design/performance-profiler.rst b/docs/development/design/performance-profiler.rst new file mode 100644 index 00000000..f834a915 --- /dev/null +++ b/docs/development/design/performance-profiler.rst @@ -0,0 +1,118 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + + +==================== +Performance Profiler +==================== + +https://goo.gl/98Osig + +This blueprint proposes to create a performance profiler for doctor scenarios. + +Problem Description +=================== + +In the verification job for notification time, we have encountered some +performance issues, such as + +1. In environment deployed by APEX, it meets the criteria while in the one by +Fuel, the performance is much more poor. +2. Signification performance degradation was spotted when we increase the total +number of VMs + +It takes time to dig the log and analyse the reason. People have to collect +timestamp at each checkpoints manually to find out the bottleneck. A performance +profiler will make this process automatic. + +Proposed Change +=============== + +Current Doctor scenario covers the inspector and notifier in the whole fault +management cycle:: + + start end + + + + + + + + | | | | | | + |monitor|inspector|notifier|manager|controller| + +------>+ | | | | + occurred +-------->+ | | | + | detected +------->+ | | + | | identified +-------+ | + | | notified +--------->+ + | | | processed resolved + | | | | + | +<-----doctor----->+ | + | | + | | + +<---------------fault management------------>+ + +The notification time can be split into several parts and visualized as a +timeline:: + + start end + 0----5---10---15---20---25---30---35---40---45--> (x 10ms) + + + + + + + + + + + + + 0-hostdown | | | | | | | | | + +--->+ | | | | | | | | | + | 1-raw failure | | | | | | | + | +-->+ | | | | | | | | + | | 2-found affected | | | | | + | | +-->+ | | | | | | | + | | 3-marked host down| | | | | + | | +-->+ | | | | | | + | | 4-set VM error| | | | | + | | +--->+ | | | | | + | | | 5-notified VM error | | + | | | +----->| | | | | + | | | | 6-transformed event + | | | | +-->+ | | | + | | | | | 7-evaluated event + | | | | | +-->+ | | + | | | | | 8-fired alarm + | | | | | +-->+ | + | | | | | 9-received alarm + | | | | | +-->+ + sample | sample | | | |10-handled alarm + monitor| inspector |nova| c/m | aodh | + | | + +<-----------------doctor--------------->+ + +Note: c/m = ceilometer + +And a table of components sorted by time cost from most to least + ++----------+---------+----------+ +|Component |Time Cost|Percentage| ++==========+=========+==========+ +|inspector |160ms | 40% | ++----------+---------+----------+ +|aodh |110ms | 30% | ++----------+---------+----------+ +|monitor |50ms | 14% | ++----------+---------+----------+ +|... | | | ++----------+---------+----------+ +|... | | | ++----------+---------+----------+ + +Note: data in the table is for demonstration only, not actual measurement + +Timestamps can be collected from various sources + +1. log files +2. trace point in code + +The performance profiler will be integrated into the verification job to provide +detail result of the test. It can also be deployed independently to diagnose +performance issue in specified environment. + +Working Items +============= + +1. PoC with limited checkpoints +2. Integration with verification job +3. Collect timestamp at all checkpoints +4. Display the profiling result in console +5. Report the profiling result to test database +6. Independent package which can be installed to specified environment diff --git a/docs/development/design/port-data-plane-status.rst b/docs/development/design/port-data-plane-status.rst new file mode 100644 index 00000000..06cfc3c6 --- /dev/null +++ b/docs/development/design/port-data-plane-status.rst @@ -0,0 +1,180 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +==================================== +Port data plane status +==================================== + +https://bugs.launchpad.net/neutron/+bug/1598081 + +Neutron does not detect data plane failures affecting its logical resources. +This spec addresses that issue by means of allowing external tools to report to +Neutron about faults in the data plane that are affecting the ports. A new REST +API field is proposed to that end. + + +Problem Description +=================== + +An initial description of the problem was introduced in bug #159801 [1_]. This +spec focuses on capturing one (main) part of the problem there described, i.e. +extending Neutron's REST API to cover the scenario of allowing external tools +to report network failures to Neutron. Out of scope of this spec are works to +enable port status changes to be received and managed by mechanism drivers. + +This spec also tries to address bug #1575146 [2_]. Specifically, and argued by +the Neutron driver team in [3_]: + + * Neutron should not shut down the port completly upon detection of physnet + failure; connectivity between instances on the same node may still be + reachable. Externals tools may or may not want to trigger a status change on + the port based on their own logic and orchestration. + + * Port down is not detected when an uplink of a switch is down; + + * The physnet bridge may have multiple physical interfaces plugged; shutting + down the logical port may not be needed in case network redundancy is in + place. + + +Proposed Change +=============== + +A couple of possible approaches were proposed in [1_] (comment #3). This spec +proposes tackling the problema via a new extension API to the port resource. +The extension adds a new attribute 'dp-down' (data plane down) to represent the +status of the data plane. The field should be read-only by tenants and +read-write by admins. + +Neutron should send out an event to the message bus upon toggling the data +plane status value. The event is relevant for e.g. auditing. + + +Data Model Impact +----------------- + +A new attribute as extension will be added to the 'ports' table. + ++------------+-------+----------+---------+--------------------+--------------+ +|Attribute |Type |Access |Default |Validation/ |Description | +|Name | | |Value |Conversion | | ++============+=======+==========+=========+====================+==============+ +|dp_down |boolean|RO, tenant|False |True/False | | +| | |RW, admin | | | | ++------------+-------+----------+---------+--------------------+--------------+ + + +REST API Impact +--------------- + +A new API extension to the ports resource is going to be introduced. + +.. code-block:: python + + EXTENDED_ATTRIBUTES_2_0 = { + 'ports': { + 'dp_down': {'allow_post': False, 'allow_put': True, + 'default': False, 'convert_to': convert_to_boolean, + 'is_visible': True}, + }, + } + + +Examples +~~~~~~~~ + +Updating port data plane status to down: + +.. code-block:: json + + PUT /v2.0/ports/<port-uuid> + Accept: application/json + { + "port": { + "dp_down": true + } + } + + + +Command Line Client Impact +-------------------------- + +:: + + neutron port-update [--dp-down <True/False>] <port> + openstack port set [--dp-down <True/False>] <port> + +Argument --dp-down is optional. Defaults to False. + + +Security Impact +--------------- + +None + +Notifications Impact +-------------------- + +A notification (event) upon toggling the data plane status (i.e. 'dp-down' +attribute) value should be sent to the message bus. Such events do not happen +with high frequency and thus no negative impact on the notification bus is +expected. + +Performance Impact +------------------ + +None + +IPv6 Impact +----------- + +None + +Other Deployer Impact +--------------------- + +None + +Developer Impact +---------------- + +None + +Implementation +============== + +Assignee(s) +----------- + + * cgoncalves + +Work Items +---------- + + * New 'dp-down' attribute in 'ports' database table + * API extension to introduce new field to port + * Client changes to allow for data plane status (i.e. 'dp-down' attribute') + being set + * Policy (tenants read-only; admins read-write) + + +Documentation Impact +==================== + +Documentation for both administrators and end users will have to be +contemplated. Administrators will need to know how to set/unset the data plane +status field. + + +References +========== + +.. [1] RFE: Port status update, + https://bugs.launchpad.net/neutron/+bug/1598081 + +.. [2] RFE: ovs port status should the same as physnet + https://bugs.launchpad.net/neutron/+bug/1575146 + +.. [3] Neutron Drivers meeting, July 21, 2016 + http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-07-21-22.00.html diff --git a/docs/development/design/report-host-fault-to-update-server-state-immediately.rst b/docs/development/design/report-host-fault-to-update-server-state-immediately.rst new file mode 100644 index 00000000..2f6ce145 --- /dev/null +++ b/docs/development/design/report-host-fault-to-update-server-state-immediately.rst @@ -0,0 +1,248 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +.. NOTE:: + This is a specification draft of a blueprint proposed for OpenStack Nova + Liberty. It was written by project member(s) and agreed within the project + before submitting it upstream. No further changes to its content will be + made here anymore; please follow it upstream: + + * Current version upstream: https://review.openstack.org/#/c/169836/ + * Development activity: + https://blueprints.launchpad.net/nova/+spec/mark-host-down + + **Original draft is as follow:** + +==================================================== +Report host fault to update server state immediately +==================================================== + +https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately + +A new API is needed to report a host fault to change the state of the +instances and compute node immediately. This allows usage of evacuate API +without a delay. The new API provides the possibility for external monitoring +system to detect any kind of host failure fast and reliably and inform +OpenStack about it. Nova updates the compute node state and states of the +instances. This way the states in the Nova DB will be in sync with the +real state of the system. + +Problem description +=================== +* Nova state change for failed or unreachable host is slow and does not + reliably state compute node is down or not. This might cause same instance + to run twice if action taken to evacuate instance to another host. +* Nova state for instances on failed compute node will not change, + but remains active and running. This gives user a false information about + instance state. Currently one would need to call "nova reset-state" for each + instance to have them in error state. +* OpenStack user cannot make HA actions fast and reliably by trusting instance + state and compute node state. +* As compute node state changes slowly one cannot evacuate instances. + +Use Cases +--------- +Use case in general is that in case there is a host fault one should change +compute node state fast and reliably when using DB servicegroup backend. +On top of this here is the use cases that are not covered currently to have +instance states changed correctly: +* Management network connectivity lost between controller and compute node. +* Host HW failed. + +Generic use case flow: + +* The external monitoring system detects a host fault. +* The external monitoring system fences the host if not down already. +* The external system calls the new Nova API to force the failed compute node + into down state as well as instances running on it. +* Nova updates the compute node state and state of the effected instances to + Nova DB. + +Currently nova-compute state will be changing "down", but it takes a long +time. Server state keeps as "vm_state: active" and "power_state: +running", which is not correct. By having external tool to detect host faults +fast, fence host by powering down and then report host down to OpenStack, all +these states would reflect to actual situation. Also if OpenStack will not +implement automatic actions for fault correlation, external tool can do that. +This could be configured for example in server instance METADATA easily and be +read by external tool. + +Project Priority +----------------- +Liberty priorities have not yet been defined. + +Proposed change +=============== +There needs to be a new API for Admin to state host is down. This API is used +to mark compute node and instances running on it down to reflect the real +situation. + +Example on compute node is: + +* When compute node is up and running: + vm_state: active and power_state: running + nova-compute state: up status: enabled +* When compute node goes down and new API is called to state host is down: + vm_state: stopped power_state: shutdown + nova-compute state: down status: enabled + +vm_state values: soft-delete, deleted, resized and error +should not be touched. +task_state effect needs to be worked out if needs to be touched. + +Alternatives +------------ +There is no attractive alternatives to detect all different host faults than +to have a external tool to detect different host faults. For this kind of tool +to exist there needs to be new API in Nova to report fault. Currently there +must have been some kind of workarounds implemented as cannot trust or get the +states from OpenStack fast enough. + +Data model impact +----------------- +None + +REST API impact +--------------- +* Update CLI to report host is down + + nova host-update command + + usage: nova host-update [--status <enable|disable>] + [--maintenance <enable|disable>] + [--report-host-down] + <hostname> + + Update host settings. + + Positional arguments + + <hostname> + Name of host. + + Optional arguments + + --status <enable|disable> + Either enable or disable a host. + + --maintenance <enable|disable> + Either put or resume host to/from maintenance. + + --down + Report host down to update instance and compute node state in db. + +* Update Compute API to report host is down: + + /v2.1/{tenant_id}/os-hosts/{host_name} + + Normal response codes: 200 + Request parameters + + Parameter Style Type Description + host_name URI xsd:string The name of the host of interest to you. + + { + "host": { + "status": "enable", + "maintenance_mode": "enable" + "host_down_reported": "true" + + } + + } + + { + "host": { + "host": "65c5d5b7e3bd44308e67fc50f362aee6", + "maintenance_mode": "enabled", + "status": "enabled" + "host_down_reported": "true" + + } + + } + +* New method to nova.compute.api module HostAPI class to have a + to mark host related instances and compute node down: + set_host_down(context, host_name) + +* class novaclient.v2.hosts.HostManager(api) method update(host, values) + Needs to handle reporting host down. + +* Schema does not need changes as in db only service and server states are to + be changed. + +Security impact +--------------- +API call needs admin privileges (in the default policy configuration). + +Notifications impact +-------------------- +None + +Other end user impact +--------------------- +None + +Performance Impact +------------------ +Only impact is that user can get information faster about instance and +compute node state. This also gives possibility to evacuate faster. +No impact that would slow down. Host down should be rare occurrence. + +Other deployer impact +--------------------- +Developer can make use of any external tool to detect host fault and report it +to OpenStack. + +Developer impact +---------------- +None + +Implementation +============== +Assignee(s) +----------- +Primary assignee: Tomi Juvonen +Other contributors: Ryota Mibu + +Work Items +---------- +* Test cases. +* API changes. +* Documentation. + +Dependencies +============ +None + +Testing +======= +Test cases that exists for enabling or putting host to maintenance should be +altered or similar new cases made test new functionality. + +Documentation Impact +==================== + +New API needs to be documented: + +* Compute API extensions documentation. + http://developer.openstack.org/api-ref-compute-v2.1.html +* Nova commands documentation. + http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html +* Compute command-line client documentation. + http://docs.openstack.org/cli-reference/content/novaclient_commands.html +* nova.compute.api documentation. + http://docs.openstack.org/developer/nova/api/nova.compute.api.html +* High Availability guide might have page to tell external tool could provide + ability to provide faster HA as able to update states by new API. + http://docs.openstack.org/high-availability-guide/content/index.html + +References +========== +* OPNFV Doctor project: https://wiki.opnfv.org/doctor +* OpenStack Instance HA Proposal: + http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/ +* The Different Facets of OpenStack HA: + http://blog.russellbryant.net/2015/03/10/ + the-different-facets-of-openstack-ha/ diff --git a/docs/development/design/rfe-port-status-update.rst b/docs/development/design/rfe-port-status-update.rst new file mode 100644 index 00000000..d87d7d7b --- /dev/null +++ b/docs/development/design/rfe-port-status-update.rst @@ -0,0 +1,32 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + +========================== +Neutron Port Status Update +========================== + +.. NOTE:: + This document represents a Neutron RFE reviewed in the Doctor project before submitting upstream to Launchpad Neutron + space. The document is not intended to follow a blueprint format or to be an extensive document. + For more information, please visit http://docs.openstack.org/developer/neutron/policies/blueprints.html + + The RFE was submitted to Neutron. You can follow the discussions in https://bugs.launchpad.net/neutron/+bug/1598081 + +Neutron port status field represents the current status of a port in the cloud infrastructure. The field can take one of +the following values: 'ACTIVE', 'DOWN', 'BUILD' and 'ERROR'. + +At present, if a network event occurs in the data-plane (e.g. virtual or physical switch fails or one of its ports, +cable gets pulled unintentionally, infrastructure topology changes, etc.), connectivity to logical ports may be affected +and tenants' services interrupted. When tenants/cloud administrators are looking up their resources' status (e.g. Nova +instances and services running in them, network ports, etc.), they will wrongly see everything looks fine. The problem +is that Neutron will continue reporting port 'status' as 'ACTIVE'. + +Many SDN Controllers managing network elements have the ability to detect and report network events to upper layers. +This allows SDN Controllers' users to be notified of changes and react accordingly. Such information could be consumed +by Neutron so that Neutron could update the 'status' field of those logical ports, and additionally generate a +notification message to the message bus. + +However, Neutron misses a way to be able to receive such information through e.g. ML2 driver or the REST API ('status' +field is read-only). There are pros and cons on both of these approaches as well as other possible approaches. This RFE +intends to trigger a discussion on how Neutron could be improved to receive fault/change events from SDN Controllers or +even also from 3rd parties not in charge of controlling the network (e.g. monitoring systems, human admins). |