Section6_VNF_HA.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329

=======================
6 VNF High Availability
=======================


************************
6.1 Service Availability
************************

In the context of NFV, Service Availability refers to the End-to-End (E2E) Service
Availability which includes all the elements in the end-to-end service (VNFs and
infrastructure components) with the exception of the customer terminal such as
handsets, computers, modems, etc. The service availability requirements for NFV
should be the same as those for legacy systems (for the same service).

Service Availability =
total service available time /
(total service available time + total service recovery time)

The service recovery time among others depends on the number of redundant resources
provisioned and/or instantiated that can be used for restoring the service.

In the E2E relation a Network Service is available only of all the necessary
Network Functions are available and interconnected appropriately to collaborate
according to the NF chain.

General Service Availability Requirements
=========================================

* We need to be able to define the E2E (V)NF chain based on which the E2E availability
  requirements can be decomposed into requirements applicable to individual VNFs and
  their interconnections
* The interconnection of the VNFs should be logical and be maintained by the NFVI with
  guaranteed characteristics, e.g. in case of failure the connection should be
  restored within the acceptable tolerance time
* These characteristics should be maintained in VM migration, failovers and switchover,
  scale in/out, etc. scenarios
* It should be possible to prioritize the different network services and their VNFs.
  These priorities should be used when pre-emption policies are applied due to
  resource shortage for example.
* VIM should support policies to prioritize a certain VNF.
* VIM should be able to provide classified virtual resources to VNFs in different SAL

6.1.1 Service Availability Classification Levels
================================================

The [ETSI-NFV-REL_] defined three Service Availability Levels
(SAL) are classified in Table 1. They are based on the relevant ITU-T recommendations
and reflect the service types and the customer agreements a network operator should
consider.

.. [ETSI-NFV-REL] `ETSI GS NFV-REL 001 V1.1.1 (2015-01) <http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf>`_


*Table 1: Service Availability classification levels*

+-------------+-----------------+-----------------------+---------------------+
|SAL Type     | Customer Type   |  Service/Function     |   Notes             |
+=============+=================+=======================+=====================+
|Level 1      | Network Operator|  * Intra-carrier      |   Sub-levels within |
|             | Control Traffic |    engineering        |   Level 1 may be    |
|             |                 |    traffic            |   created by the    |
|             | Government/     |  * Emergency          |   Network Operator  |
|             | Regulatory      |    telecommunication  |   depending on      |
|             | Emergency       |    service (emergency |   Customer demands  |
|             | Services        |    response, emergency|   E.g.:             |
|             |                 |    dispatch)          |                     |
|             |                 |  * Critical Network   |   * 1A - Control;   |
|             |                 |    Infrastructure     |   * 1B - Real-time; |
|             |                 |    Functions (e.g     |   * 1C - Data;      |
|             |                 |    VoLTE functions    |                     |
|             |                 |    DNS Servers,etc.)  |   May require 1+1   |
|             |                 |                       |   Redundancy with   |
|             |                 |                       |   Instantaneous     |
|             |                 |                       |   Switchover        |
+-------------+-----------------+-----------------------+---------------------+
|Level 2      | Enterprise and/ |  * VPN                |  Sub-levels within  |
|             | or large scale  |  * Real-time traffic  |  Level 2 may be     |
|             | customers       |    (Voice and video)  |  created  by the    |
|             | (e.g.           |  * Network            |  Network Operator   |
|             | Corporations,   |    Infrastructure     |  depending on       |
|             | University)     |    Functions          |  Customer demands.  |
|             |                 |    supporting Level   |  E.g.:              |
|             | Network         |    2 services (e.g.   |                     |
|             | Operators       |    VPN servers,       |  * 2A - VPN;        |
|             | (Tier1/2/3)     |    Corporate Web/     |  * 2B - Real-time;  |
|             | service traffic |    Mail servers)      |  * 2C - Data;       |
|             |                 |                       |                     |
|             |                 |                       |  May require 1:1    |
|             |                 |                       |  Redundancy with    |
|             |                 |                       |  Fast (maybe        |
|             |                 |                       |  Instantaneous)     |
|             |                 |                       |  Switchover         |
+-------------+-----------------+-----------------------+---------------------+
|Level 3      | General Consumer|  * Data traffic       |  While this is      |
|             | Public and ISP  |    (including voice   |  typically          |
|             | Traffic         |    and video traffic  |  considered to be   |
|             |                 |    provided by OTT)   |  "Best Effort"      |
|             |                 |  * Network            |  traffic, it is     |
|             |                 |    Infrastructure     |  expected that      |
|             |                 |    Functions          |  Network Operators  |
|             |                 |    supporting Level   |  will devote        |
|             |                 |    3 services         |  sufficient         |
|             |                 |                       |  resources to       |
|             |                 |                       |  assure             |
|             |                 |                       |  "satisfactory"     |
|             |                 |                       |  levels of          |
|             |                 |                       |  availability.      |
|             |                 |                       |  This level of      |
|             |                 |                       |  service may be     |
|             |                 |                       |  pre-empted by      |
|             |                 |                       |  those with         |
|             |                 |                       |  higher levels of   |
|             |                 |                       |  Service            |
|             |                 |                       |  Availability. May  |
|             |                 |                       |  require M+1        |
|             |                 |                       |  Redundancy with    |
|             |                 |                       |  Fast Switchover;   |
|             |                 |                       |  where M > 1 and    |
|             |                 |                       |  the value of M to  |
|             |                 |                       |  be determined by   |
|             |                 |                       |  further study      |
+-------------+-----------------+-----------------------+---------------------+

Requirements
^^^^^^^^^^^^

* It shall be possible to define different service availability levels
* It shall be possible to classify the virtual resources for the different
  availability class levels
* The VIM shall provide a mechanism by which VNF-specific requirements
  can be mapped to NFVI-specific capabilities.

More specifically, the requirements and capabilities may or may not be made up of the
same KPI-like strings, but the cloud administrator must be able to configure which
HA-specific VNF requirements are satisfied by which HA-specific NFVI capabilities.


6.1.2 Metrics for Service Availability
======================================

The [ETSI-NFV-REL_] identifies four metrics relevant to service
availability:

* Failure recovery time,
* Failure impact fraction,
* Failure frequency, and
* Call drop rate.

6.1.2.1 Failure Recovery Time
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The failure recovery time is the time interval from the occurrence of an abnormal
event (e.g. failure, manual interruption of service, etc.) until the recovery of the
service regardless if it is a scheduled or unscheduled abnormal event. For the
unscheduled case, the recovery time includes the failure detection time and the
failure restoration time.
More specifically restoration also allows for a service recovery by the restart of
the failed provider(s) while failover implies that the service is recovered by a
redundant provider taking over the service. This provider may be a standby
(i.e. synchronizing the service state with the active provider) or a spare
(i.e. having no state information). Accordingly failover also means switchover, that
is, an orederly takeover of the service from the active provider by the standby/spare.

Requirements
^^^^^^^^^^^^

* It should be irrelevant whether the abnormal event is due to a scheduled or
  unscheduled operation or it is caused by a fault.
* Failure detection mechanisms should be available in the NFVI and configurable so
  that the target recovery times can be met
* Abnormal events should be logged and communicated (i.e. notifications and alarms as
  appropriate)

The TL-9000 forum has specified a service interruption time of 15 seconds as outage
for all traditional telecom system services. [ETSI-NFV-REL_]
recommends the setting of different thresholds for the different Service Availability
Levels. An example setting is given in the following table 2. Note that for all
Service Availability levels Real-time Services require the fastest recovery time.
Data services can tolerate longer recovery times. These recovery times are applicable
to the user plane. A failure in the control plane does not have to impact the user plane.
The main concern should be simultaneous failures in the control and user planes
as the user plane cannot typically recover without the control plane. However an HA
mechanism in VNF itself can further mitigate the risk. Note also that the impact on
the user plane depends on the control plane service experiencing the failure,
some of them are more critical than others.


*Table 2: Example service recovery times for the service availability levels*

+------------+-----------------+------------------------------------------+
|SAL         |  Service        |  Notes                                   |
|            |  Recovery       |                                          |
|            |  Time           |                                          |
|            |  Threshold      |                                          |
+============+=================+==========================================+
|1           | 5 - 6 seconds   | Recommendation: Redundant resources to be|
|            |                 | made available on-site to  ensure fast   |
|            |                 | recovery.                                |
+------------+-----------------+------------------------------------------+
|2           | 10 - 15 seconds | Recommendation: Redundant resources to be|
|            |                 | available as a mix of on-site and off-   |
|            |                 | site as appropriate.                     |
|            |                 |                                          |
|            |                 |  * On-site resources to be utilized for  |
|            |                 |    recovery of real-time services.       |
|            |                 |  * Off-site resources to be utilized for |
|            |                 |    recovery of data services.            |
+------------+-----------------+------------------------------------------+
|3           | 20 - 25 seconds | Recommendation: Redundant resources to be|
|            |                 | mostly available off-site. Real-time     |
|            |                 | services should be recovered before data |
|            |                 | services                                 |
+------------+-----------------+------------------------------------------+


6.1.2.2 Failure Impact Fraction
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The failure impact fraction is the maximum percentage of the capacity or user
population affected by a failure compared with the total capacity or the user
population supported by a service. It is directly associated with the failure impact
zone which is the set of resources/elements of the system to which the fault may
propagate.

Requirements
^^^^^^^^^^^^

* It should be possible to define the failure impact zone for all the elements of the
  system
* At the detection of a failure of an element, its failure impact zone must be
  isolated before the associated recovery mechanism is triggered
* If the isolation of the failure impact zone is unsuccessful the isolation should be
  attempted at the next higher level as soon as possible to prevent fault propagation.
* It should be possible to define different levels of failure impact zones with
  associated isolation and alarm generation policies
* It should be possible to limit the collocation of VMs to reduce the failure impact
  zone as well as to provide sufficient resources

6.1.2.3 Failure Frequency
^^^^^^^^^^^^^^^^^^^^^^^^^

Failure frequency is the number of failures in a certain period of time.

Requirements
^^^^^^^^^^^^

* There should be a probation period for each failure impact zones within which
  failures are correlated.
* The threshold and the probation period for the failure impact zones should be
  configurable
* It should be possible to define failure escalation policies for the different
  failure impact zones


6.1.2.4 Call Drop Rate
^^^^^^^^^^^^^^^^^^^^^^

Call drop rate reflects service continuity as well as system reliability and
stability. The metric is inside the VNF and therefore is not specified further for
the NFV environment.

Requirements
^^^^^^^^^^^^

* It shall be possible to specify for each service availability class the associated
  availability metrics and their thresholds
* It shall be possible to collect data for the defined metrics
* It shall be possible to delegate the enforcement of some thresholds to the NFVI
* Accordingly it shall be possible to request virtual resources with guaranteed
  characteristics, such as guaranteed latency between VMs (i.e. VNFCs), between a VM
  and storage, between VNFs


**********************
6.2 Service Continuity
**********************

The determining factor with respect to service continuity is the statefulness of the
VNF. If the VNF is stateless, there is no state information which needs to be
preserved to prevent the perception of service discontinuity in case of failure or
other disruptive events.
If the VNF is stateful, the NF has a service state which needs to be preserved
throughout such disruptive events in order to shield the service consumer from these
events and provide the perception of service continuity. A VNF may maintain this state
internally or externally or a combination with or without the NFVI being aware of the
purpose of the stored data.

Requirements
============

* The NFVI should maintain the number of VMs provided to the VNF in the face of
  failures. I.e. the failed VM instances should be replaced by new VM instances
* It should be possible to specify whether the NFVI or the VNF/VNFM handles the
  service recovery and continuity
* If the VNF/VNFM handles the service recovery it should be able to receive error
  reports and/or detect failures in a timely manner.
* The VNF (i.e. between VNFCs) may have its own fault detection mechanism, which might
  be triggered prior to receiving the error report from the underlying NFVI therefore
  the NFVI/VIM should not attempt to preserve the state of a failing VM if not
  configured to do so
* The VNF/VNFM should be able to initiate the repair/reboot of resources of the VNFI
  (e.g. to recover from a fault persisting at the VNF level => failure impact zone
  escalation)
* It should be possible to disallow the live migration of VMs and when it is allowed
  it should be possible to specify the tolerated interruption time.
* It should be possible to restrict the simultaneous migration of VMs hosting a given
  VNF
* It should be possible to define under which circumstances the NFV-MANO in
  collaboration with the NFVI should provide error handling (e.g. VNF handles local
  recoveries while NFV-MANO handles geo-redundancy)
* The NFVI/VIM should provide virtual resource such as storage according to the needs
  of the VNF with the required guarantees (see virtual resource classification).
* The VNF shall be able to define the information to be stored on its associated
  virtual storage
* It should be possible to define HA requirements for the storage, its availability,
  accessibility, resilience options, i.e. the NFVI shall handle the failover for the
  storage.
* The NFVI shall handle the network/connectivity failures transparent to the VNFs
* The VNFs with different requirements should be able to coexist in the NFV Framework
* The scale in/out is triggered by the VNF (VNFM) towards the VIM (to be executed in
  the NFVI)
* It should be possible to define the metrics to monitor and the related thresholds
  that trigger the scale in/out operation
* Scale in operation should not jeopardize availability (managed by the VNF/VNFM),
  i.e. resources can only be removed one at a time with a period in between sufficient
  for the VNF to restore any required redundancy.