1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
|
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0
.. SPDX-License-Identifier CC-BY-4.0
.. (c) Open Platform for NFV Project, Inc. and its contributors
================================================================
Auto User Guide: Use Case 2 Resiliency Improvements Through ONAP
================================================================
This document provides the user guide for Fraser release of Auto,
specifically for Use Case 2: Resiliency Improvements Through ONAP.
Description
===========
This use case illustrates VNF failure recovery time reduction with ONAP, thanks to its automated monitoring and management. It:
* simulates an underlying problem (failure, stress, or any adverse condition in the network that can impact VNFs)
* tracks a VNF
* measures the amount of time it takes for ONAP to restore the VNF functionality.
The benefit for NFV edge service providers is to assess what degree of added VIM+NFVI platform resilience for VNFs is obtained by leveraging ONAP closed-loop control, vs. VIM+NFVI self-managed resilience (which may not be aware of the VNF or the corresponding end-to-end Service, but only of underlying resources such as VMs and servers).
Also, a problem, or challenge, may not necessarily be a failure (which could also be recovered by other layers): it could be an issue leading to suboptimal performance, without failure. A VNF management layer as provided by ONAP may detect such non-failure problems, and provide a recovery solution which no other layer could provide in a given deployment.
Preconditions:
#. hardware environment in which Edge cloud may be deployed
#. Edge cloud has been deployed and is ready for operation
#. ONAP has been deployed onto a cloud and is interfaced (i.e. provisioned for API access) to the Edge cloud
#. Components of ONAP have been deployed on the Edge cloud as necessary for specific test objectives
In future releases, Auto Use cases will also include the deployment of ONAP (if not already installed), the deployment of test VNFs (pre-existing VNFs in pre-existing ONAP can be used in the test as well), the configuration of ONAP for monitoring these VNFs (policies, CLAMP, DCAE), in addition to the test scripts which simulate a problem and measures recovery time.
Different types of problems can be simulated, hence the identification of multiple test cases corresponding to this use case, as illustrated in this diagram:
.. image:: auto-UC02-testcases.jpg
Description of simulated problems/challenges, leading to various test cases:
* Physical Infra Failure
* Migration upon host failure: Compute host power is interrupted, and affected workloads are migrated to other available hosts.
* Migration upon disk failure: Disk volumes are unmounted, and affected workloads are migrated to other available hosts.
* Migration upon link failure: Traffic on links is interrupted/corrupted, and affected workloads are migrated to other available hosts.
* Migration upon NIC failure: NIC ports are disabled by host commands, and affected workloads are migrated to other available hosts.
* Virtual Infra Failure
* OpenStack compute host service fail: Core OpenStack service processes on compute hosts are terminated, and auto-restored, or affected workloads are migrated to other available hosts.
* SDNC service fail: Core SDNC service processes are terminated, and auto-restored.
* OVS fail: OVS bridges are disabled, and affected workloads are migrated to other available hosts.
* etc.
* Security
* Host tampering: Host tampering is detected, the host is fenced, and affected workloads are migrated to other available hosts.
* Host intrusion: Host intrusion attempts are detected, an offending workload, device, or flow is identified and fenced, and as needed affected workloads are migrated to other available hosts.
* Network intrusion: Network intrusion attempts are detected, and an offending flow is identified and fenced.
Test execution high-level description
=====================================
The following two MSCs (Message Sequence Charts) show the actors and high-level interactions.
The first MSC shows the preparation activities (assuming the hardware, network, cloud, and ONAP have already been installed): onboarding and deployment of VNFs (via ONAP portal and modules in sequence: SDC, VID, SO), and ONAP configuration (policy framework, closed-loops in CLAMP, activation of DCAE).
.. image:: auto-UC02-preparation.jpg
The second MSC illustrates the pattern of all test cases for the Resiliency Improvements:
* simulate the chosen problem (a.k.a. a "Challenge") for this test case, for example suspend a VM which may be used by a VNF
* start tracking the target VNF of this test case
* measure the ONAP-orchestrated VNF Recovery Time
* then the test stops simulating the problem (for example: resume the VM that was suspended)
In parallel, the MSC also shows the sequence of events happening in ONAP, thanks to its configuration to provide Service Assurance for the VNF.
.. image:: auto-UC02-pattern.jpg
Test design: data model, implementation modules
===============================================
The high-level design of classes identifies several entities, described as follows:
* ``Test Case`` : as identified above, each is a special case of the overall use case (e.g., categorized by challenge type)
* ``Test Definition`` : gathers all the information necessary to run a certain test case
* ``Metric Definition`` : describes a certain metric that may be measured for a Test Case, in addition to Recovery Time
* ``Challenge Definition`` : describe the challenge (problem, failure, stress, ...) simulated by the test case
* ``Recipient`` : entity that can receive commands and send responses, and that is queried by the Test Definition or Challenge Definition (a recipient would be typically a management service, with interfaces (CLI or API) for clients to query)
* ``Resources`` : with 3 types (VNF, cloud virtual resource such as a VM, physical resource such as a server)
Three of these entities have execution-time corresponding classes:
* ``Test Execution`` , which captures all the relevant data of the execution of a Test Definition
* ``Challenge Execution`` , which captures all the relevant data of the execution of a Challenge Definition
* ``Metric Value`` , which captures the quantitative measurement of a Metric Definition (with a timestamp)
.. image:: auto-UC02-data1.jpg
The following diagram illustrates an implementation-independent design of the attributes of these entities:
.. image:: auto-UC02-data2.jpg
This next diagram shows the Python classes and attributes, as implemented by this Use Case (for all test cases):
.. image:: auto-UC02-data3.jpg
Test definition data is stored in serialization files (Python pickles), while test execution data is stored in CSV files, for easier post-analysis.
The module design is straightforward: functions and classes for managing data, for interfacing with recipients, for executing tests, and for interacting with the test user (choosing a Test Definition, showing the details of a Test Definition, starting the execution).
.. image:: auto-UC02-module1.jpg
This last diagram shows the test user menu functions, when used interactively:
.. image:: auto-UC02-module2.jpg
In future releases of Auto, testing environments such as Robot, FuncTest and Yardstick might be leveraged. Use Case code will then be invoked by API, not by a CLI interaction.
Also, anonymized test results could be collected from users willing to share them, and aggregates could be
maintained as benchmarks.
As further illustration, the next figure shows cardinalities of class instances: one Test Definition per Test Case, multiple Test Executions per Test Definition, zero or one Recovery Time Metric Value per Test Execution (zero if the test failed for any reason, including if ONAP failed to recover the challenge), etc.
.. image:: auto-UC02-cardinalities.png
In this particular implementation, both Test Definition and Challenge Definition classes have a generic execution method (e.g., ``run_test_code()`` for Test Definition) which can invoke a particular script, by way of an ID (which can be configured, and serves as a script selector for each Test Definition instance). The overall test execution logic between classes is show in the next figure.
.. image:: auto-UC02-logic.png
The execution of a test case starts with invoking the generic method from Test Definition, which then creates Execution instances, invokes Challenge Definition methods, performs the Recovery time calculation, performs script-specific actions, and writes results to the CSV files.
Finally, the following diagram show a mapping between these class instances and the initial test case design. It corresponds to the test case which simulates a VM failure, and shows how the OpenStack SDK API is invoked (with a connection object) by the Challenge Definition methods, to suspend and resume a VM.
.. image:: auto-UC02-TC-mapping.png
|