summaryrefslogtreecommitdiffstats
path: root/docs/requirements/VNF_high_availability_across_VIM.rst
blob: 42c479e98175d719c60e9b94f16f0d6d1e6912d9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0

================================
VNF high availability across VIM
================================

Problem description
===================

Abstract
--------

a VNF (telecom application) should, be able to realize high availability
deloyment across OpenStack instances.

Description
-----------
VNF (Telecom application running over cloud) may (already) be designed as
Active-Standby/Active-Active/N-Way to achieve high availability,

With a telecoms focus, this generally refers both to availability of service
(i.e. the ability to make new calls), but also maintenance of ongoing control
plane state and active media processing(i.e. “keeping up” existing calls).

Traditionally telecoms systems are designed to maintain state and calls across
pretty much the full range of single-point failures.  As listed this includes
power supply, hard drive, physical server or network switch, but also covers
software failure, and maintenance operations such as software upgrade.

To provide this support, typically requires state replication between
application instances (directly or via replicated database services, or via
private designed message format).  It may also require special case handling of
media endpoints, to allow transfer of median short time scales (<1s) without
requiring end-to-end resignalling (e.g.RTP redirection via IP / MAC address
transfers c.f VRRP).

With a migration to NFV, a commonly expressed desire by carriers is to provide
the same resilience to any single point(s) of failure in the cloud
infrastructure.

This could be done by making each cloud instance fully HA (a non-trivial task to
do right and to prove it has been done right) , but the preferred approach
appears to be to accept the currently limited availability of a given cloud
instance (no desire to radically rework this for telecoms), and instead to
provide solution availability by spreading function across multiple cloud
instances (i.e. the same approach used today todeal with hardware and software
failures).

A further advantage of this approach, is it provides a good basis for seamless
upgrade of infrastructure software revision, where you can spin up an additional
up-level cloud, gradually transfer over resources / app instances from one of
your other clouds, before finally turning down the old cloud instance when no
longer required.

If fast media / control failure over is still required (which many/most carriers
still seem to believe it is) there are some interesting/hard requirements on the
networking between cloud instances. To help with this, many people appear
willing to provide multiple “independent” cloud instances in a single geographic
site, with special networking between clouds in that physical site.
"independent" in quotes is because some coordination between cloud instances is
obviously required, but this has to be implemented in a fashion which reduces
the potential for correlated failure to very low levels (at least as low as the
required overall application availability).

Analysis of requirements to OpenStack
=====================================
The VNF often has different networking plane for different purpose:

external network plane: using for communication with other VNF
components inter-communication plane: one VNF often consisted of several
components, this plane is designed for components inter-communication with each
other
backup plance: this plane is used for the heart beat or state replication
between the component's active/standy or active/active or N-way cluster.
management plane: this plane is mainly for the management purpose

Generally these planes are seperated with each other. And for legacy telecom
application, each internal plane will have its fixed or flexible IP addressing
plan.

To make the VNF can work with HA mode across different OpenStack instances in
one site (but not limited to), need to support at lease the backup plane across
different OpenStack instances:

1) L2 networking across OpenStack instance for heartbeat or state replication.
Overlay L2 networking or shared L2 provider networks can work as the backup
plance for heartbeat or state replication. Overlay L2 network is preferred,
the reason is:

   a. Support legacy compatibility: Some telecom app with built-in internal L2
      network, for easy to move these app to VNF, it would be better to provide
      L2 network.
   b. Isolated L2 network will simplify the security management between
      different network planes.
   c. Easy to support IP/mac floating across OpenStack.
   d. Support IP overlapping: multiple VNFs may have overlaping IP address for
      cross OpenStack instance networking.

Therefore, over L2 networking across Neutron feature is required in OpenStack.

2) L3 networking across OpenStack instance for heartbeat or state replication.
For L3 networking, we can leverage the floating IP provided in current
Neutron, or use VPN or BGPVPN(networking-bgpvpn) to setup the connection.

L3 networking to support the VNF HA will consume more resources and need to
take more security factors into consideration, this make the networking
more complex. And L3 networking is also not possible to provide IP floating
across OpenStack instances.

3) The IP address used for VNF to connect with other VNFs should be able to be
floating cross OpenStack instance. For example, if the master failed, the IP
address should be used in the standby which is running in another OpenStack
instance. There are some method like VRRP/GARP etc can help the movement of the
external IP, so no new feature will be added to OpenStack.


Prototype
---------
    None.

Proposed solution
-----------------
Several projects are addressing the networking requirements:
  * Tricircle: https://github.com/openstack/tricircle/
  * Networking-BGPVPN: https://github.com/openstack/networking-bgpvpn/
  * VPNaaS: https://github.com/openstack/neutron-vpnaas

Gaps
====
    Inter-networking among OpenStack clouds for application HA need is lack
    in Neutron, and covered by sevral new created projects.

**NAME-THE-MODULE issues:**

* Neutron

Affected By
-----------
    OPNFV multisite cloud.

References
==========