.. image:: opnfv-logo.png
  :height: 40
  :width: 200
  :alt: OPNFV
  :align: left

============
High Availability Requirement Analysis in OPNFV
============

******************
1 Introduction
******************
This High Availability Requirement Analysis Document is used for eliciting High Availability
Requirements of OPNFV. The document will refine high-level High Availability goals, into
detailed HA mechanism design. And HA mechanisms are related with potential failures on
different layers in OPNFV. Moreover, this document can be used as reference for HA Testing
scenarios design.
A requirement engineering model KAOS is used in this document.

******************
2 Terminologies and Symbols
******************
The following concepts in KAOS will be used in the diagrams of this document.

- **Goal**: The objective to be met by the target system.

- **Obstacle**: Condition whose satisfaction may prevent some goals from being achieved.

- **Agent**: Active Object performing operations to achieve goals.

- **Requirement**: Goal assigned to an agent of the software being studied.

- **Domain Property**: Descriptive assertion about objects in the environment of the software.

- **Refinement**: Relationship linking a goal to other goals that are called its subgoals.
  Each subgoal contributes to the satisfaction of the goal it refines. There are two types of
  refinements: AND refinement and OR refinement, which means whether the goal can be archived by
  satisfying all of its sub goals or any one of its sub goals.

- **Conflict**: Relationship linking an obstacle to a goal if the obstacle obstructs the goal
  from being satisfied.

- **Resolution**: Relationship linking a goal to an obstacle if the goal can resolve the
  obstacle.

- **Responsibility**: Relationship between an agent and a requirement. Holds when an agent is
  assigned the responsibility of achieving the linked requirement.

Figure 1 shows how these concepts are displayed in a KAOS diagram.

.. figure:: images/KAOS_Sample.png
    :alt: KAOS Sample
    :figclass: align-center

    Fig 1. A KAOS Sample Diagram

******************
3 High Availability Goals of OPNFV
******************

3.1 Overall Goals
>>>>>>>>>>>>>>>>>>

The Final Goal of OPNFV High Availability is to provide high available VNF services. And the
following objectives are required to meet:

- There should be no single point of failure in the NFV framework.

- All resiliency mechanisms shall be designed for a multi-vendor environment, where for example
  the NFVI, NFV-MANO, and VNFs may be supplied by different vendors.

- Resiliency related information shall always be explicitly specified and communicated using
  the reference interfaces (including policies/templates) of the NFV framework.


3.2 Service Level Agreements of OPNFV HA
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Service Level Agreements of OPNFV HA are mainly focused on time constraints of service outage,
failure detection, failure recovery. The following table outlines the SLA metrics of different
service availability levels described in ETSI GS NFV-REL 001 V1.1.1 (2015-01). Table 1 shows
time constraints of different Service Availability Levels. In this document, SAL1 is the
default benchmark value required to meet.

*Table 1. Time Constraints for Different Service Availability Levels*

+--------------------------------+----------------------------+------------------------+
| Service Availability Level     | Failure Detection Time     | Failure Recovery Time  |
+================================+============================+========================+
| SAL1                           | <1s                        | 5-6s                   |
+--------------------------------+----------------------------+------------------------+
| SAL2                           | <5s                        | 10-15s                 |
+--------------------------------+----------------------------+------------------------+
| SAL3                           | <10s                       | 20-25s                 |
+--------------------------------+----------------------------+------------------------+


******************
4 Overall Analysis
******************
Figure 2 shows the overall decomposition of high availability goals. The high availability of
VNF Services can be refined to high availability of VNFs, MANO, and the NFVI where VNFs are
deployed; the high availability of NFVI Service can be refined to high availability of Virtual
Compute Instances, Virtual Storage and Virtual Network Services; the high availability of
virtual instance is either the high availability of containers or the high availability of VMs,
and these high availability goals can be further decomposed by how the NFV environment is
deployed.

.. figure:: images/Total_Framework.png
    :alt: Overall HA Analysis of OPNFV
    :figclass: align-center

    Fig 2. Overall HA Analysis of OPNFV

Thus the high availability requirement of VNF services can be classified into high availability
requirements on different layers in OPNFV. The following layers are mainly discussed in this
document:

- VNF HA

- MANO HA

- Virtual Infrastructure HA (container HA or VM HA)

- VIM HA

- SDN HA

- Hypervisor HA

- Host OS HA

- Hardware HA

The next section will illustrate detailed analysis of HA requirements on these layers.

******************
5 Detailed Analysis
******************

5.1 VNF HA
>>>>>>>>>>>>>>>>>>

.. TBD

5.2 MANO HA
>>>>>>>>>>>>>>>>>>

.. TBD

5.3 Virtual Infrastructure HA
>>>>>>>>>>>>>>>>>>

.. TBD

5.4 VIM HA
>>>>>>>>>>>>>>>>>>

The VIM in the NFV reference architecture contains different components of Openstack, SDN
controllers and other virtual resource controllers. VIM components can be classified into three
types:

- **Entry Point Components**: Components that give VIM service interfaces to users, like nova-
  api, neutron-server.

- **Middlewares**: Components that provide load balancer services, messaging queues, cluster
  management services, etc.

- **Subcomponents**: Components that implement VIM functions, which are called by Entry Point
  Components but not by users directly.

Table 2 shows the potential faults that may happen on VIM layer. Currently the main focus of
VIM HA is the service crash of VIM components, which may occur on all types of VIM components.
To prevent VIM services from being unavailable, Active/Active Redundancy, Active/Passive
Redundancy and Message Queue are used for different types of VIM components, as is shown in
figure 3.

*Table 2. Potential Faults in VIM level*

+------------+------------------+-------------------------------------------------+----------------+
| Service    | Fault            | Description                                     | Severity       |
+============+==================+=================================================+================+
| General    | Service Crash    | The processes of a service crashed unnormally.  | Critical       |
+------------+------------------+-------------------------------------------------+----------------+

.. figure:: images/VIM_Analysis.png
    :alt: VIM HA Analysis
    :figclass: align-center

    Fig 3. VIM HA Analysis


Active/Active Redundancy
::::::::::::::::::::::::::::
Active/Active Redundancy manages both the main and redundant systems concurrently. If there is
a failure happens on a component, the backups are already online and users are unlikely to
notice that the failed VIM component is under fixing. A typical Active/Active Redundancy will
have redundant instances, and these instances are load balanced via a virtual IP address and a
load balancer such as HAProxy.

When one of the redundant VIM component fails, the load balancer should be aware of the
instance failure, and then isolate the failed instance from being called until it is recovered.
The requirement decomposition of Active/Active Redundancy is shown in Figure 4.

.. figure:: images/Active_Active_Redundancy.png
    :alt: Active/Active Redundancy Requirement Decomposition
    :figclass: align-center

    Fig 4. Active/Active Redundancy Requirement Decomposition

The following requirements are elicited for VIM Active/Active Redundancy:

**[Req 5.4.1]** Redundant VIM components should be load balanced by a load balancer.

**[Req 5.4.2]** The load balancer should check the health status of VIM component instances.

**[Req 5.4.3]** The load balancer should isolate the failed VIM component instance until it is
recovered.

**[Req 5.4.4]** The alarm information of VIM component failure should be reported.

**[Req 5.4.5]** Failed VIM component instances should be recovered by a cluster manager.

Table 3 shows the current VIM components using Active/Active Redundancy and the corresponding
HA test cases to verify them.

*Table 3. VIM Components using Active/Active Redundancy*

+-------------------+-------------------------------------------------------+----------------------+
| Component         | Description                                           | Related HA Test Case |
+===================+=======================================================+======================+
| nova-api          | endpoint component of Openstack Compute Service Nova  | yardstick_tc019      |
+-------------------+-------------------------------------------------------+----------------------+
| nova-novncproxy   | server daemon that serves the Nova noVNC Websocket    |                      |
|                   | Proxy service, which provides a websocket proxy that  |                      |
|                   | is compatible with OpenStack Nova noVNC consoles.     |                      |
+-------------------+-------------------------------------------------------+----------------------+
| neeutron-server   | endpoint component of Openstack Networking Service    | yardstick_tc045      |
|                   | Neutron                                               |                      |
+-------------------+-------------------------------------------------------+----------------------+
| keystone          | component of Openstack Identity Service Service       | yardstick_tc046      |
|                   | Keystone                                              |                      |
+-------------------+-------------------------------------------------------+----------------------+
| glance-api        | endpoint component of Openstack Image Service Glance  | yardstick_tc047      |
+-------------------+-------------------------------------------------------+----------------------+
| glance-registry   | server daemon that serves image metadata through a    |                      |
|                   | REST-like API.                                        |                      |
+-------------------+-------------------------------------------------------+----------------------+
| cinder-api        | endpoint component of Openstack Block Storage Service | yardstick_tc048      |
|                   | Service Cinder                                        |                      |
+-------------------+-------------------------------------------------------+----------------------+
| swift-proxy       | endpoint component of Openstack Object Storage        | yardstick_tc049      |
|                   | Swift                                                 |                      |
+-------------------+-------------------------------------------------------+----------------------+
| horizon           | component of Openstack Dashboard Service Horizon      |                      |
+-------------------+-------------------------------------------------------+----------------------+
| heat-api          | endpoint component of Openstack Stack Service Heat    |                      |
+-------------------+-------------------------------------------------------+----------------------+
| mysqld            | database service of VIM components                    |                      |
+-------------------+-------------------------------------------------------+----------------------+

Active/Passive Redundancy
::::::::::::::::::::::::::::

Active/Passive Redundancy maintains a redundant instance that can be brought online when the
active service fails. A typical Active/Passive Redundancy maintains replacement resources that
can be brought online when required. Requests are handled using a virtual IP address (VIP) that
facilitates returning to service with minimal reconfiguration. A cluster manager (such as
Pacemaker or Corosync) monitors these components, bringing the backup online as necessary.

When the main instance of a VIM component is failed, the cluster manager should be aware of the
failure and switch the backup instance online. And the failed instance should also be recovered
to another backup instance. The requirement decomposition of Active/Passive Redundancy is shown
in Figure 5.

.. figure:: images/Active_Passive_Redundancy.png
    :alt: Active/Passive Redundancy Requirement Decomposition
    :figclass: align-center

    Fig 5. Active/Passive Redundancy Requirement Decomposition

The following requirements are elicited for VIM Active/Passive Redundancy:

**[Req 5.4.6]** The cluster manager should replace the failed main VIM component instance with
a backup instance.

**[Req 5.4.7]** The cluster manager should check the health status of VIM component instances.

**[Req 5.4.8]** Failed VIM component instances should be recovered by the cluster manager.

**[Req 5.4.9]** The alarm information of VIM component failure should be reported.


Table 4 shows the current VIM components using Active/Passive Redundancy and the corresponding
HA test cases to verify them.

*Table 4. VIM Components using Active/Passive Redundancy*

+-------------------+-------------------------------------------------------+----------------------+
| Component         | Description                                           | Related HA Test Case |
+===================+=======================================================+======================+
| haproxy           | load balancer component of VIM components             | yardstick_tc053      |
+-------------------+-------------------------------------------------------+----------------------+
| rabbitmq-server   | messaging queue service of VIM components             | yardstick_tc056      |
+-------------------+-------------------------------------------------------+----------------------+
| corosync          | cluster management component of VIM components        | yardstick_tc057      |
+-------------------+-------------------------------------------------------+----------------------+

Message Queue
::::::::::::::::::::::::::::
Message Queue provides an asynchronous communication protocol. In Openstack, some projects (
like Nova, Cinder) use Message Queue to call their sub components. Although Message Queue
itself is not an HA mechanism, how it works ensures the high availability when redundant
components subscribe to the Message Queue. When a VIM sub component fails, since there are
other redundant components are subscribing to the Message Queue, requests still can be processed.
And fault isolation can also be archived since failed components won't fetch requests actively.
Also, the recovery of failed components is required. Figure 6 shows the requirement
decomposition of Message Queue.

.. figure:: images/Message_Queue.png
    :alt: Message Queue Requirement Decomposition
    :figclass: align-center

    Fig 6. Message Queue Redundancy Requirement Decomposition

The following requirements are elicited for Message Queue:

**[Req 5.4.10]** Redundant component instances should subscribe to the Message Queue, which is
implemented by the installer.

**[Req 5.4.11]** Failed VIM component instances should be recovered by the cluster manager.

**[Req 5.4.12]** The alarm information of VIM component failure should be reported.

Table 5 shows the current VIM components using Message Queue and the corresponding HA test cases
to verify them.

*Table 5. VIM Components using Messaging Queue*

+-------------------+-------------------------------------------------------+----------------------+
| Component         | Description                                           | Related HA Test Case |
+===================+=======================================================+======================+
| nova-scheduler    | Openstack compute component determines how to         |                      |
|                   | dispatch compute requests                             |                      |
+-------------------+-------------------------------------------------------+----------------------+
| nova-cert         | Openstack compute component that serves the Nova Cert |                      |
|                   | service for X509 certificates. Used to generate       |                      |
|                   | certificates for euca-bundle-image.                   |                      |
+-------------------+-------------------------------------------------------+----------------------+
| nova-conductor    | server daemon that serves the Nova Conductor service, |                      |
|                   | which provides coordination and database query        |                      |
|                   | support for Nova.                                     |                      |
+-------------------+-------------------------------------------------------+----------------------+
| nova-compute      | Handles all processes relating to instances (guest    |                      |
|                   | vms). nova-compute is responsible for building a disk |                      |
|                   | image, launching it via the underlying virtualization |                      |
|                   | driver, responding to calls to check its state,       |                      |
|                   | attaching persistent storage, and terminating it.     |                      |
+-------------------+-------------------------------------------------------+----------------------+
| nova-consoleauth  | Openstack compute component for Authentication of     |                      |
|                   | nova consoles.                                        |                      |
+-------------------+-------------------------------------------------------+----------------------+
| cinder-scheduler  | Openstack volume storage component decides on         |                      |
|                   | placement for newly created volumes and forwards the  |                      |
|                   | request to cinder-volume.                             |                      |
+-------------------+-------------------------------------------------------+----------------------+
| cinder-volume     | Openstack volume storage component receives volume    |                      |
|                   | management requests from cinder-api and               |                      |
|                   | cinder-scheduler, and routes them to storage backends |                      |
|                   | using vendor-supplied drivers.                        |                      |
+-------------------+-------------------------------------------------------+----------------------+
| heat-engine       | Openstack Heat project server with an internal RPC    |                      |
|                   | api called by the heat-api server.                    |                      |
+-------------------+-------------------------------------------------------+----------------------+


5.5 Hypervisor HA
>>>>>>>>>>>>>>>>>>

.. TBD

5.6 Host OS HA
>>>>>>>>>>>>>>>>>>

.. TBD

5.7 Hardware HA
>>>>>>>>>>>>>>>>>>

.. TBD


******************
6 References
******************

- A KAOS Tutorial: http://www.objectiver.com/fileadmin/download/documents/KaosTutorial.pdf

- ETSI GS NFV-REL 001 V1.1.1(2015-01):
  http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf

- Openstack High Availability Guide: https://docs.openstack.org/ha-guide/

- Highly Available (Mirrored) Queues: https://www.rabbitmq.com/ha.html