Merge "Add High Availability Analysis Document"

author: Qiao Fu <fuqiao@chinamobile.com> 2018-03-23 08:54:02 +0000
committer: Gerrit Code Review <gerrit@opnfv.org> 2018-03-23 08:54:02 +0000
commit: 30dcfd148572763ff773e71dd67276e0c23925ea (patch)
tree: 34e5e24322d1c304e21ef62f85b4386e2bc97e20 /R4_HA_Analysis/HA_Analysis.rst
parent: 1c2ce630bf5eba741860232b89e67afc2f7cf71c (diff)
parent: b80e36ccbfa698d6f98f426afadbeb6c0c7a264b (diff)
1 files changed, 406 insertions, 0 deletions
diff --git a/R4_HA_Analysis/HA_Analysis.rst b/R4_HA_Analysis/HA_Analysis.rst
new file mode 100644
index 0000000..06c0487
--- /dev/null
+++ b/R4_HA_Analysis/HA_Analysis.rst
@@ -0,0 +1,406 @@
+.. image:: opnfv-logo.png
+  :height: 40
+  :width: 200
+  :alt: OPNFV
+  :align: left
+
+============
+High Availability Requirement Analysis in OPNFV
+============
+
+******************
+1 Introduction
+******************
+This High Availability Requirement Analysis Document is used for eliciting High Availability
+Requirements of OPNFV. The document will refine high-level High Availability goals, into
+detailed HA mechanism design. And HA mechanisms are related with potential failures on
+different layers in OPNFV. Moreover, this document can be used as reference for HA Testing
+scenarios design.
+A requirement engineering model KAOS is used in this document.
+
+******************
+2 Terminologies and Symbols
+******************
+The following concepts in KAOS will be used in the diagrams of this document.
+
+- **Goal**: The objective to be met by the target system.
+
+- **Obstacle**: Condition whose satisfaction may prevent some goals from being achieved.
+
+- **Agent**: Active Object performing operations to achieve goals.
+
+- **Requirement**: Goal assigned to an agent of the software being studied.
+
+- **Domain Property**: Descriptive assertion about objects in the environment of the software.
+
+- **Refinement**: Relationship linking a goal to other goals that are called its subgoals.
+  Each subgoal contributes to the satisfaction of the goal it refines. There are two types of
+  refinements: AND refinement and OR refinement, which means whether the goal can be archived by
+  satisfying all of its sub goals or any one of its sub goals.
+
+- **Conflict**: Relationship linking an obstacle to a goal if the obstacle obstructs the goal
+  from being satisfied.
+
+- **Resolution**: Relationship linking a goal to an obstacle if the goal can resolve the
+  obstacle.
+
+- **Responsibility**: Relationship between an agent and a requirement. Holds when an agent is
+  assigned the responsibility of achieving the linked requirement.
+
+Figure 1 shows how these concepts are displayed in a KAOS diagram.
+
+.. figure:: images/KAOS_Sample.png
+    :alt: KAOS Sample
+    :figclass: align-center
+
+    Fig 1. A KAOS Sample Diagram
+
+******************
+3 High Availability Goals of OPNFV
+******************
+
+3.1 Overall Goals
+>>>>>>>>>>>>>>>>>>
+
+The Final Goal of OPNFV High Availability is to provide high available VNF services. And the
+following objectives are required to meet:
+
+- There should be no single point of failure in the NFV framework.
+
+- All resiliency mechanisms shall be designed for a multi-vendor environment, where for example
+  the NFVI, NFV-MANO, and VNFs may be supplied by different vendors.
+
+- Resiliency related information shall always be explicitly specified and communicated using
+  the reference interfaces (including policies/templates) of the NFV framework.
+
+
+
+3.2 Service Level Agreements of OPNFV HA
+>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
+
+Service Level Agreements of OPNFV HA are mainly focused on time constraints of service outage,
+failure detection, failure recovery. The following table outlines the SLA metrics of different
+service availability levels described in ETSI GS NFV-REL 001 V1.1.1 (2015-01). Table 1 shows
+time constraints of different Service Availability Levels. In this document, SAL1 is the
+default benchmark value required to meet.
+
+*Table 1. Time Constraints for Different Service Availability Levels*
+
++--------------------------------+----------------------------+------------------------+
+| Service Availability Level     | Failure Detection Time     | Failure Recovery Time  |
++================================+============================+========================+
+| SAL1                           | <1s                        | 5-6s                   |
++--------------------------------+----------------------------+------------------------+
+| SAL2                           | <5s                        | 10-15s                 |
++--------------------------------+----------------------------+------------------------+
+| SAL3                           | <10s                       | 20-25s                 |
++--------------------------------+----------------------------+------------------------+
+
+
+******************
+4 Overall Analysis
+******************
+Figure 2 shows the overall decomposition of high availability goals. The high availability of
+VNF Services can be refined to high availability of VNFs, MANO, and the NFVI where VNFs are
+deployed; the high availability of NFVI Service can be refined to high availability of Virtual
+Compute Instances, Virtual Storage and Virtual Network Services; the high availability of
+virtual instance is either the high availability of containers or the high availability of VMs,
+and these high availability goals can be further decomposed by how the NFV environment is
+deployed.
+
+.. figure:: images/Total_Framework.png
+    :alt: Overall HA Analysis of OPNFV
+    :figclass: align-center
+
+    Fig 2. Overall HA Analysis of OPNFV
+
+Thus the high availability requirement of VNF services can be classified into high availability
+requirements on different layers in OPNFV. The following layers are mainly discussed in this
+document:
+
+- VNF HA
+
+- MANO HA
+
+- Virtual Infrastructure HA (container HA or VM HA)
+
+- VIM HA
+
+- SDN HA
+
+- Hypervisor HA
+
+- Host OS HA
+
+- Hardware HA
+
+The next section will illustrate detailed analysis of HA requirements on these layers.
+
+******************
+5 Detailed Analysis
+******************
+
+5.1 VNF HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+5.2 MANO HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+5.3 Virtual Infrastructure HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+5.4 VIM HA
+>>>>>>>>>>>>>>>>>>
+
+The VIM in the NFV reference architecture contains different components of Openstack, SDN
+controllers and other virtual resource controllers. VIM components can be classified into three
+types:
+
+- **Entry Point Components**: Components that give VIM service interfaces to users, like nova-
+  api, neutron-server.
+
+- **Middlewares**: Components that provide load balancer services, messaging queues, cluster
+  management services, etc.
+
+- **Subcomponents**: Components that implement VIM functions, which are called by Entry Point
+  Components but not by users directly.
+
+Table 2 shows the potential faults that may happen on VIM layer. Currently the main focus of
+VIM HA is the service crash of VIM components, which may occur on all types of VIM components.
+To prevent VIM services from being unavailable, Active/Active Redundancy, Active/Passive
+Redundancy and Message Queue are used for different types of VIM components, as is shown in
+figure 3.
+
+*Table 2. Potential Faults in VIM level*
+
++------------+------------------+-------------------------------------------------+----------------+
+| Service    | Fault            | Description                                     | Severity       |
++============+==================+=================================================+================+
+| General    | Service Crash    | The processes of a service crashed unnormally.  | Critical       |
++------------+------------------+-------------------------------------------------+----------------+
+
+.. figure:: images/VIM_Analysis.png
+    :alt: VIM HA Analysis
+    :figclass: align-center
+
+    Fig 3. VIM HA Analysis
+
+
+Active/Active Redundancy
+::::::::::::::::::::::::::::
+Active/Active Redundancy manages both the main and redundant systems concurrently. If there is
+a failure happens on a component, the backups are already online and users are unlikely to
+notice that the failed VIM component is under fixing. A typical Active/Active Redundancy will
+have redundant instances, and these instances are load balanced via a virtual IP address and a
+load balancer such as HAProxy.
+
+When one of the redundant VIM component fails, the load balancer should be aware of the
+instance failure, and then isolate the failed instance from being called until it is recovered.
+The requirement decomposition of Active/Active Redundancy is shown in Figure 4.
+
+.. figure:: images/Active_Active_Redundancy.png
+    :alt: Active/Active Redundancy Requirement Decomposition
+    :figclass: align-center
+
+    Fig 4. Active/Active Redundancy Requirement Decomposition
+
+The following requirements are elicited for VIM Active/Active Redundancy:
+
+**[Req 5.4.1]** Redundant VIM components should be load balanced by a load balancer.
+
+**[Req 5.4.2]** The load balancer should check the health status of VIM component instances.
+
+**[Req 5.4.3]** The load balancer should isolate the failed VIM component instance until it is
+recovered.
+
+**[Req 5.4.4]** The alarm information of VIM component failure should be reported.
+
+**[Req 5.4.5]** Failed VIM component instances should be recovered by a cluster manager.
+
+Table 3 shows the current VIM components using Active/Active Redundancy and the corresponding
+HA test cases to verify them.
+
+*Table 3. VIM Components using Active/Active Redundancy*
+
++-------------------+-------------------------------------------------------+----------------------+
+| Component         | Description                                           | Related HA Test Case |
++===================+=======================================================+======================+
+| nova-api          | endpoint component of Openstack Compute Service Nova  | yardstick_tc019      |
++-------------------+-------------------------------------------------------+----------------------+
+| nova-novncproxy   | server daemon that serves the Nova noVNC Websocket    |                      |
+|                   | Proxy service, which provides a websocket proxy that  |                      |
+|                   | is compatible with OpenStack Nova noVNC consoles.     |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| neeutron-server   | endpoint component of Openstack Networking Service    | yardstick_tc045      |
+|                   | Neutron                                               |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| keystone          | component of Openstack Identity Service Service       | yardstick_tc046      |
+|                   | Keystone                                              |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| glance-api        | endpoint component of Openstack Image Service Glance  | yardstick_tc047      |
++-------------------+-------------------------------------------------------+----------------------+
+| glance-registry   | server daemon that serves image metadata through a    |                      |
+|                   | REST-like API.                                        |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| cinder-api        | endpoint component of Openstack Block Storage Service | yardstick_tc048      |
+|                   | Service Cinder                                        |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| swift-proxy       | endpoint component of Openstack Object Storage        | yardstick_tc049      |
+|                   | Swift                                                 |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| horizon           | component of Openstack Dashboard Service Horizon      |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| heat-api          | endpoint component of Openstack Stack Service Heat    |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| mysqld            | database service of VIM components                    |                      |
++-------------------+-------------------------------------------------------+----------------------+
+
+Active/Passive Redundancy
+::::::::::::::::::::::::::::
+
+Active/Passive Redundancy maintains a redundant instance that can be brought online when the
+active service fails. A typical Active/Passive Redundancy maintains replacement resources that
+can be brought online when required. Requests are handled using a virtual IP address (VIP) that
+facilitates returning to service with minimal reconfiguration. A cluster manager (such as
+Pacemaker or Corosync) monitors these components, bringing the backup online as necessary.
+
+When the main instance of a VIM component is failed, the cluster manager should be aware of the
+failure and switch the backup instance online. And the failed instance should also be recovered
+to another backup instance. The requirement decomposition of Active/Passive Redundancy is shown
+in Figure 5.
+
+.. figure:: images/Active_Passive_Redundancy.png
+    :alt: Active/Passive Redundancy Requirement Decomposition
+    :figclass: align-center
+
+    Fig 5. Active/Passive Redundancy Requirement Decomposition
+
+The following requirements are elicited for VIM Active/Passive Redundancy:
+
+**[Req 5.4.6]** The cluster manager should replace the failed main VIM component instance with
+a backup instance.
+
+**[Req 5.4.7]** The cluster manager should check the health status of VIM component instances.
+
+**[Req 5.4.8]** Failed VIM component instances should be recovered by the cluster manager.
+
+**[Req 5.4.9]** The alarm information of VIM component failure should be reported.
+
+
+Table 4 shows the current VIM components using Active/Passive Redundancy and the corresponding
+HA test cases to verify them.
+
+*Table 4. VIM Components using Active/Passive Redundancy*
+
++-------------------+-------------------------------------------------------+----------------------+
+| Component         | Description                                           | Related HA Test Case |
++===================+=======================================================+======================+
+| haproxy           | load balancer component of VIM components             | yardstick_tc053      |
++-------------------+-------------------------------------------------------+----------------------+
+| rabbitmq-server   | messaging queue service of VIM components             | yardstick_tc056      |
++-------------------+-------------------------------------------------------+----------------------+
+| corosync          | cluster management component of VIM components        | yardstick_tc057      |
++-------------------+-------------------------------------------------------+----------------------+
+
+Message Queue
+::::::::::::::::::::::::::::
+Message Queue provides an asynchronous communication protocol. In Openstack, some projects (
+like Nova, Cinder) use Message Queue to call their sub components. Although Message Queue
+itself is not an HA mechanism, how it works ensures the high availability when redundant
+components subscribe to the Message Queue. When a VIM sub component fails, since there are
+other redundant components are subscribing to the Message Queue, requests still can be processed.
+And fault isolation can also be archived since failed components won't fetch requests actively.
+Also, the recovery of failed components is required. Figure 6 shows the requirement
+decomposition of Message Queue.
+
+.. figure:: images/Message_Queue.png
+    :alt: Message Queue Requirement Decomposition
+    :figclass: align-center
+
+    Fig 6. Message Queue Redundancy Requirement Decomposition
+
+The following requirements are elicited for Message Queue:
+
+**[Req 5.4.10]** Redundant component instances should subscribe to the Message Queue, which is
+implemented by the installer.
+
+**[Req 5.4.11]** Failed VIM component instances should be recovered by the cluster manager.
+
+**[Req 5.4.12]** The alarm information of VIM component failure should be reported.
+
+Table 5 shows the current VIM components using Message Queue and the corresponding HA test cases
+to verify them.
+
+*Table 5. VIM Components using Messaging Queue*
+
++-------------------+-------------------------------------------------------+----------------------+
+| Component         | Description                                           | Related HA Test Case |
++===================+=======================================================+======================+
+| nova-scheduler    | Openstack compute component determines how to         |                      |
+|                   | dispatch compute requests                             |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| nova-cert         | Openstack compute component that serves the Nova Cert |                      |
+|                   | service for X509 certificates. Used to generate       |                      |
+|                   | certificates for euca-bundle-image.                   |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| nova-conductor    | server daemon that serves the Nova Conductor service, |                      |
+|                   | which provides coordination and database query        |                      |
+|                   | support for Nova.                                     |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| nova-compute      | Handles all processes relating to instances (guest    |                      |
+|                   | vms). nova-compute is responsible for building a disk |                      |
+|                   | image, launching it via the underlying virtualization |                      |
+|                   | driver, responding to calls to check its state,       |                      |
+|                   | attaching persistent storage, and terminating it.     |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| nova-consoleauth  | Openstack compute component for Authentication of     |                      |
+|                   | nova consoles.                                        |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| cinder-scheduler  | Openstack volume storage component decides on         |                      |
+|                   | placement for newly created volumes and forwards the  |                      |
+|                   | request to cinder-volume.                             |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| cinder-volume     | Openstack volume storage component receives volume    |                      |
+|                   | management requests from cinder-api and               |                      |
+|                   | cinder-scheduler, and routes them to storage backends |                      |
+|                   | using vendor-supplied drivers.                        |                      |
++-------------------+-------------------------------------------------------+----------------------+
+| heat-engine       | Openstack Heat project server with an internal RPC    |                      |
+|                   | api called by the heat-api server.                    |                      |
++-------------------+-------------------------------------------------------+----------------------+
+
+
+5.5 Hypervisor HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+5.6 Host OS HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+5.7 Hardware HA
+>>>>>>>>>>>>>>>>>>
+
+.. TBD
+
+
+******************
+6 References
+******************
+
+- A KAOS Tutorial: http://www.objectiver.com/fileadmin/download/documents/KaosTutorial.pdf
+
+- ETSI GS NFV-REL 001 V1.1.1(2015-01):
+  http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf
+
+- Openstack High Availability Guide: https://docs.openstack.org/ha-guide/
+
+- Highly Available (Mirrored) Queues: https://www.rabbitmq.com/ha.html
+\ No newline at end of file
author	Qiao Fu <fuqiao@chinamobile.com>	2018-03-23 08:54:02 +0000
committer	Gerrit Code Review <gerrit@opnfv.org>	2018-03-23 08:54:02 +0000
commit	30dcfd148572763ff773e71dd67276e0c23925ea (patch)
tree	34e5e24322d1c304e21ef62f85b4386e2bc97e20 /R4_HA_Analysis/HA_Analysis.rst
parent	1c2ce630bf5eba741860232b89e67afc2f7cf71c (diff)
parent	b80e36ccbfa698d6f98f426afadbeb6c0c7a264b (diff)