From 72a1f8c92f1692f1ea8dcb5bc706ec9939c30e0a Mon Sep 17 00:00:00 2001 From: Tomi Juvonen Date: Tue, 13 Oct 2020 16:37:57 +0300 Subject: Documents up-to-date According to document guidelines Release notes ETSI FEAT03 support and other minor enhancements JIRA: DOCTOR-143 Signed-off-by: Tomi Juvonen Change-Id: Iefa74004dfada376d1ab05c0149029a26f822275 --- .../fault_management/fault_management.rst | 90 ++++++++++++++++ .../maintenance/images/Fault-management-design.png | Bin 0 -> 237110 bytes docs/release/scenarios/maintenance/images/LICENSE | 14 +++ .../maintenance/images/Maintenance-design.png | Bin 0 -> 316640 bytes .../maintenance/images/Maintenance-workflow.png | Bin 0 -> 81286 bytes docs/release/scenarios/maintenance/maintenance.rst | 120 +++++++++++++++++++++ 6 files changed, 224 insertions(+) create mode 100644 docs/release/scenarios/fault_management/fault_management.rst create mode 100644 docs/release/scenarios/maintenance/images/Fault-management-design.png create mode 100644 docs/release/scenarios/maintenance/images/LICENSE create mode 100644 docs/release/scenarios/maintenance/images/Maintenance-design.png create mode 100644 docs/release/scenarios/maintenance/images/Maintenance-workflow.png create mode 100644 docs/release/scenarios/maintenance/maintenance.rst (limited to 'docs/release/scenarios') diff --git a/docs/release/scenarios/fault_management/fault_management.rst b/docs/release/scenarios/fault_management/fault_management.rst new file mode 100644 index 00000000..99371201 --- /dev/null +++ b/docs/release/scenarios/fault_management/fault_management.rst @@ -0,0 +1,90 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + + +Running test cases +"""""""""""""""""" + +Functest will call the "doctor_tests/main.py" in Doctor to run the test job. +Doctor testing can also be triggered by tox on OPNFV installer jumphost. Tox +is normally used for functional, module and coding style testing in Python +project. + +Currently 'MCP' and 'devstack' installer are supported. + + +Fault management use case +""""""""""""""""""""""""" + +* A consumer of the NFVI wants to receive immediate notifications about faults + in the NFVI affecting the proper functioning of the virtual resources. + Therefore, such faults have to be detected as quickly as possible, and, when + a critical error is observed, the affected consumer is immediately informed + about the fault and can switch over to the STBY configuration. + +The faults to be monitored (and at which detection rate) will be configured by +the consumer. Once a fault is detected, the Inspector in the Doctor +architecture will check the resource map maintained by the Controller, to find +out which virtual resources are affected and then update the resources state. +The Notifier will receive the failure event requests sent from the Controller, +and notify the consumer(s) of the affected resources according to the alarm +configuration. + +Detailed workflow information is as follows: + +* Consumer(VNFM): (step 0) creates resources (network, server/instance) and an + event alarm on state down notification of that server/instance or Neutron + port. + +* Monitor: (step 1) periodically checks nodes, such as ping from/to each + dplane nic to/from gw of node, (step 2) once it fails to send out event + with "raw" fault event information to Inspector + +* Inspector: when it receives an event, it will (step 3) mark the host down + ("mark-host-down"), (step 4) map the PM to VM, and change the VM status to + down. In network failure case, also Neutron port is changed to down. + +* Controller: (step 5) sends out instance update event to Ceilometer. In network + failure case, also Neutron port is changed to down and corresponding event is + sent to Ceilometer. + +* Notifier: (step 6) Ceilometer transforms and passes the events to AODH, + (step 7) AODH will evaluate events with the registered alarm definitions, + then (step 8) it will fire the alarm to the "consumer" who owns the + instance + +* Consumer(VNFM): (step 9) receives the event and (step 10) recreates a new + instance + +Fault management test case +"""""""""""""""""""""""""" + +Functest will call the 'doctor-test' command in Doctor to run the test job. + +The following steps are executed: + +Firstly, get the installer ip according to the installer type. Then ssh to +the installer node to get the private key for accessing to the cloud. As +'fuel' installer, ssh to the controller node to modify nova and ceilometer +configurations. + +Secondly, prepare image for booting VM, then create a test project and test +user (both default to doctor) for the Doctor tests. + +Thirdly, boot a VM under the doctor project and check the VM status to verify +that the VM is launched completely. Then get the compute host info where the VM +is launched to verify connectivity to the target compute host. Get the consumer +ip according to the route to compute ip and create an alarm event in Ceilometer +using the consumer ip. + +Fourthly, the Doctor components are started, and, based on the above preparation, +a failure is injected to the system, i.e. the network of compute host is +disabled for 3 minutes. To ensure the host is down, the status of the host +will be checked. + +Finally, the notification time, i.e. the time between the execution of step 2 +(Monitor detects failure) and step 9 (Consumer receives failure notification) +is calculated. + +According to the Doctor requirements, the Doctor test is successful if the +notification time is below 1 second. diff --git a/docs/release/scenarios/maintenance/images/Fault-management-design.png b/docs/release/scenarios/maintenance/images/Fault-management-design.png new file mode 100644 index 00000000..6d98cdec Binary files /dev/null and b/docs/release/scenarios/maintenance/images/Fault-management-design.png differ diff --git a/docs/release/scenarios/maintenance/images/LICENSE b/docs/release/scenarios/maintenance/images/LICENSE new file mode 100644 index 00000000..21a2d03d --- /dev/null +++ b/docs/release/scenarios/maintenance/images/LICENSE @@ -0,0 +1,14 @@ +Copyright 2017 Open Platform for NFV Project, Inc. and its contributors + +Open Platform for NFV Project Documentation License +=================================================== +Any documentation developed by the "Open Platform for NFV Project" +is licensed under a Creative Commons Attribution 4.0 International License. +You should have received a copy of the license along with this. If not, +see . + +Unless required by applicable law or agreed to in writing, documentation +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/docs/release/scenarios/maintenance/images/Maintenance-design.png b/docs/release/scenarios/maintenance/images/Maintenance-design.png new file mode 100644 index 00000000..8f21db6a Binary files /dev/null and b/docs/release/scenarios/maintenance/images/Maintenance-design.png differ diff --git a/docs/release/scenarios/maintenance/images/Maintenance-workflow.png b/docs/release/scenarios/maintenance/images/Maintenance-workflow.png new file mode 100644 index 00000000..9b65fd59 Binary files /dev/null and b/docs/release/scenarios/maintenance/images/Maintenance-workflow.png differ diff --git a/docs/release/scenarios/maintenance/maintenance.rst b/docs/release/scenarios/maintenance/maintenance.rst new file mode 100644 index 00000000..ecfe76b1 --- /dev/null +++ b/docs/release/scenarios/maintenance/maintenance.rst @@ -0,0 +1,120 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 + + +Maintenance use case +"""""""""""""""""""" + +* A consumer of the NFVI wants to interact with NFVI maintenance, upgrade, + scaling and to have graceful retirement. Receiving notifications over these + NFVI events and responding to those within given time window, consumer can + guarantee zero downtime to his service. + +The maintenance use case adds the Doctor platform an `admin tool` and an +`app manager` component. Overview of maintenance components can be seen in +:numref:`figure-p2`. + +.. figure:: ./images/Maintenance-design.png + :name: figure-p2 + :width: 100% + + Doctor platform components in maintenance use case + +In maintenance use case, `app manager` (VNFM) will subscribe to maintenance +notifications triggered by project specific alarms through AODH. This is the way +it gets to know different NFVI maintenance, upgrade and scaling operations that +effect to its instances. The `app manager` can do actions depicted in `green +color` or tell `admin tool` to do admin actions depicted in `orange color` + +Any infrastructure component like `Inspector` can subscribe to maintenance +notifications triggered by host specific alarms through AODH. Subscribing to the +notifications needs admin privileges and can tell when a host is out of use as +in maintenance and when it is taken back to production. + +Maintenance test case +""""""""""""""""""""" + +Maintenance test case is currently running in our Apex CI and executed by tox. +This is because the special limitation mentioned below and also the fact we +currently have only sample implementation as a proof of concept and we also +support unofficial OpenStack project Fenix. Environment variable +TEST_CASE='maintenance' needs to be used when executing "doctor_tests/main.py" +and ADMIN_TOOL_TYPE='fenix' if want to test with Fenix instead of sample +implementation. Test case workflow can be seen in :numref:`figure-p3`. + +.. figure:: ./images/Maintenance-workflow.png + :name: figure-p3 + :width: 100% + + Maintenance test case workflow + +In test case all compute capacity will be consumed with project (VNF) instances. +For redundant services on instances and an empty compute needed for maintenance, +test case will need at least 3 compute nodes in system. There will be 2 +instances on each compute, so minimum number of VCPUs is also 2. Depending on +how many compute nodes there is application will always have 2 redundant +instances (ACT-STDBY) on different compute nodes and rest of the compute +capacity will be filled with non-redundant instances. + +For each project specific maintenance message there is a time window for +`app manager` to make any needed action. This will guarantee zero +down time for his service. All replies back are done by calling `admin tool` API +given in the message. + +The following steps are executed: + +Infrastructure admin will call `admin tool` API to trigger maintenance for +compute hosts having instances belonging to a VNF. + +Project specific `MAINTENANCE` notification is triggered to tell `app manager` +that his instances are going to hit by infrastructure maintenance at a specific +point in time. `app manager` will call `admin tool` API to answer back +`ACK_MAINTENANCE`. + +When the time comes to start the actual maintenance workflow in `admin tool`, +a `DOWN_SCALE` notification is triggered as there is no empty compute node for +maintenance (or compute upgrade). Project receives corresponding alarm and scales +down instances and call `admin tool` API to answer back `ACK_DOWN_SCALE`. + +As it might happen instances are not scaled down (removed) from a single +compute node, `admin tool` might need to figure out what compute node should be +made empty first and send `PREPARE_MAINTENANCE` to project telling which instance +needs to be migrated to have the needed empty compute. `app manager` makes sure +he is ready to migrate instance and call `admin tool` API to answer back +`ACK_PREPARE_MAINTENANCE`. `admin tool` will make the migration and answer +`ADMIN_ACTION_DONE`, so `app manager` knows instance can be again used. + +:numref:`figure-p3` has next a light blue section of actions to be done for each +compute. However as we now have one empty compute, we will maintain/upgrade that +first. So on first round, we can straight put compute in maintenance and send +admin level host specific `IN_MAINTENANCE` message. This is caught by `Inspector` +to know host is down for maintenance. `Inspector` can now disable any automatic +fault management actions for the host as it can be down for a purpose. After +`admin tool` has completed maintenance/upgrade `MAINTENANCE_COMPLETE` message +is sent to tell host is back in production. + +Next rounds we always have instances on compute, so we need to have +`PLANNED_MAINTANANCE` message to tell that those instances are now going to hit +by maintenance. When `app manager` now receives this message, he knows instances +to be moved away from compute will now move to already maintained/upgraded host. +In test case no upgrade is done on application side to upgrade instances +according to new infrastructure capabilities, but this could be done here as +this information is also passed in the message. This might be just upgrading +some RPMs, but also totally re-instantiating instance with a new flavor. Now if +application runs an active side of a redundant instance on this compute, +a switch over will be done. After `app manager` is ready he will call +`admin tool` API to answer back `ACK_PLANNED_MAINTENANCE`. In test case the +answer is `migrate`, so `admin tool` will migrate instances and reply +`ADMIN_ACTION_DONE` and then `app manager` knows instances can be again used. +Then we are ready to make the actual maintenance as previously trough +`IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps. + +After all computes are maintained, `admin tool` can send `MAINTENANCE_COMPLETE` +to tell maintenance/upgrade is now complete. For `app manager` this means he +can scale back to full capacity. + +There is currently sample implementation on VNFM and test case. In +infrastructure side there is sample implementation of 'admin_tool' and +there is also support for the OpenStack Fenix that extends the use case to +support 'ETSI FEAT03' for VNFM interaction and to optimize the whole +infrastructure mainteannce and upgrade. -- cgit 1.2.3-korg