docs/requirements/VNF_high_availability_across_VIM.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160

This work is licensed under a Creative Commons Attribution 3.0 Unported License.
http://creativecommons.org/licenses/by/3.0/legalcode


=======================================
VNF high availability across VIM
=======================================

Problem description
===================

Abstract
------------

a VNF (telecom application) should, be able to realize high availability
deloyment across OpenStack instances.

Description
------------
VNF (Telecom application running over cloud) may (already) be designed as
Active-Standby/Active-Active/N-Way to achieve high availability,

With a telecoms focus, this generally refers both to availability of service
(i.e. the ability to make new calls), but also maintenance of ongoing control
plane state and active media processing(i.e. “keeping up” existing calls).

Traditionally telecoms systems are designed to maintain state and calls across
pretty much the full range of single-point failures.  As listed this includes
power supply, hard drive, physical server or network switch, but also covers
software failure, and maintenance operations such as software upgrade.

To provide this support, typically requires state replication between
application instances (directly or via replicated database services, or via
private designed message format).  It may also require special case handling of
media endpoints, to allow transfer of median short time scales (<1s) without
requiring end-to-end resignalling (e.g.RTP redirection via IP / MAC address
transfers c.f VRRP).

With a migration to NFV, a commonly expressed desire by carriers is to provide
the same resilience to any single point(s) of failure in the cloud
infrastructure.

This could be done by making each cloud instance fully HA (a non-trivial task to
do right and to prove it has been done right) , but the preferred approach
appears to be to accept the currently limited availability of a given cloud
instance (no desire to radically rework this for telecoms), and instead to
provide solution availability by spreading function across multiple cloud
instances (i.e. the same approach used today todeal with hardware and software
failures).

A further advantage of this approach, is it provides a good basis for seamless
upgrade of infrastructure software revision, where you can spin up an additional
up-level cloud, gradually transfer over resources / app instances from one of
your other clouds, before finally turning down the old cloud instance when no
longer required.

If fast media / control failure over is still required (which many/most carriers
still seem to believe it is) there are some interesting/hard requirements on the
networking between cloud instances. To help with this, many people appear
willing to provide multiple “independent” cloud instances in a single geographic
site, with special networking between clouds in that physical site.
"independent" in quotes is because some coordination between cloud instances is
obviously required, but this has to be implemented in a fashion which reduces
the potential for correlated failure to very low levels (at least as low as the
required overall application availability).

Analysis of requirements to OpenStack
===========================
The VNF often has different networking plane for different purpose:

external network plane: using for communication with other VNF
components inter-communication plane: one VNF often consisted of several
components, this plane is designed for components inter-communication with each
other
backup plance: this plane is used for the heart beat or state replication
between the component's active/standy or active/active or N-way cluster.
management plane: this plane is mainly for the management purpose

Generally these planes are seperated with each other. And for legacy telecom
application, each internal plane will have its fixed or flexsible IP addressing
plane.

to make the VNF can work with HA mode across different OpenStack instances in
one site (but not limited to), need to support at lease the backup plane across
different OpenStack instances:

1) Overlay L2 networking or shared L2 provider networks as the backup plance for
heartbeat or state replication. Overlay L2 network is preferred, the reason is:
a. Support legacy compatibility: Some telecom app with built-in internal L2
network, for easy to move these app to VNF, it would be better to provide L2
network b. Support IP overlapping: multiple VNFs may have overlaping IP address
for cross OpenStack instance networking
Therefore, over L2 networking across Neutron feature is required in OpenStack.

2) L3 networking cross OpenStack instance for heartbeat or state replication.
For L3 networking, we can leverage the floating IP provided in current Neutron,
so no new feature requirement to OpenStack.

3) The IP address used for VNF to connect with other VNFs should be able to be
floating cross OpenStack instance. For example, if the master failed, the IP
address should be used in the standby which is running in another OpenStack
instance. There are some method like VRRP/GARP etc can help the movement of the
external IP, so no new feature will be added to OpenStack.


Prototype
-----------
    None.

Proposed solution
-----------

    requirements perspective It's up to application descision to use L2 or L3
networking across Neutron.

    For Neutron, a L2 network is consisted of lots of ports. To make the cross
Neutron L2 networking is workable, we need some fake remote ports in local
Neutron to represent VMs in remote site ( remote OpenStack ).

    the fake remote port will reside on some VTEP ( for VxLAN ), the tunneling
IP address of the VTEP should be the attribute of the fake remote port, so that
the local port can forward packet to correct tunneling endpoint.

    the idea is to add one more ML2 mechnism driver to capture the fake remote
port CRUD( creation, retievement, update, delete)

    when a fake remote port is added/update/deleted, then the ML2 mechanism
driver for these fake ports will activate L2 population, so that the VTEP
tunneling endpoint information could be understood by other local ports.

    it's also required to be able to query the port's VTEP tunneling endpoint
information through Neutron API, in order to use these information to create
fake remote port in another Neutron.

    In the past, the port's VTEP ip address is the host IP where the VM resides.
But the this BP https://review.openstack.org/#/c/215409/ will make the port free
of binding to host IP as the tunneling endpoint, you can even specify L2GW ip
address as the tunneling endpoint.

    Therefore a new BP will be registered to processing the fake remote port, in
order make cross Neutron L2 networking is feasible. RFE is registered first:
https://bugs.launchpad.net/neutron/+bug/1484005


Gaps
====
    1) fake remote port for cross Neutron L2 networking


**NAME-THE-MODULE issues:**

* Neutron

Affected By
-----------
    OPNFV multisite cloud.

References
==========