1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
|
General Requirements Background and Terminology
-----------------------------------------------
Terminologies and definitions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NFVI
The term is an abbreviation for Network Function Virtualization
Infrastructure; sometimes it is also referred as data plane in this
document.
VIM
The term is an abbreviation for Virtual Infrastructure Management;
sometimes it is also referred as control plane in this document.
Operator
The term refers to network service providers and Virtual Network
Function (VNF) providers.
End-User
The term refers to a subscriber of the Operator's services.
Network Service
The term refers to a service provided by an Operator to its
End-users using a set of (virtualized) Network Functions
Infrastructure Services
The term refers to services provided by the NFV Infrastructure and the
the Management & Orchestration functions to the VNFs. I.e.
these are the virtual resources as perceived by the VNFs.
Smooth Upgrade
The term refers to an upgrade that results in no service outage
for the end-users.
Rolling Upgrade
The term refers to an upgrade strategy that upgrades each node or
a subset of nodes in a wave style rolling through the data centre. It
is a popular upgrade strategy to maintain service availability.
Parallel Universe
The term refers to an upgrade strategy that creates and deploys
a new universe - a system with the new configuration - while the old
system continues running. The state of the old system is transferred
to the new system after sufficient testing of the new system.
Infrastructure Resource Model
The term refers to the representation of infrastructure resources,
namely: the physical resources, the virtualization
facility resources and the virtual resources.
Physical Resource
The term refers to a hardware pieces of the NFV infrastructure, which may
also include the firmware which enables the hardware.
Virtual Resource
The term refers to a resource, which is provided as services built on top
of the physical resources via the virtualization facilities; in particular,
they are the resources on which VNF entities are deployed, e.g.
the VMs, virtual switches, virtual routers, virtual disks etc.
.. <MT> I don't think the VNF is the virtual resource. Virtual
resources are the VMs, virtual switches, virtual routers, virtual
disks etc. The VNF uses them, but I don't think they are equal. The
VIM doesn't manage the VNF, but it does manage virtual resources.
Visualization Facility
The term refers to a resource that enables the creation
of virtual environments on top of the physical resources, e.g.
hypervisor, OpenStack, etc.
Upgrade Plan (or Campaign?)
The term refers to a choreography that describes how the upgrade should
be performed in terms of its targets (i.e. upgrade objects), the
steps/actions required of upgrading each, and the coordination of these
steps so that service availability can be maintained. It is an input to an
upgrade tool (Escalator) to carry out the upgrade
Upgrade Objects
~~~~~~~~~~~~~~~
Physical Resource
^^^^^^^^^^^^^^^^^
Most of cloud infrastructures support dynamic addition/removal of
hardware. A hardware upgrade could be done by adding the new
hardware node and removing the old one. From the persepctive of smooth
upgrade the orchestration/scheduling of this actions is the primary concern.
Upgrading a physical resource,
like upgrading its firmware and/or modify its configuration data, may
also be considered in the future.
Virtual Resources
^^^^^^^^^^^^^^^^^
Virtual resource upgrade mainly done by users. OPNFV may facilitate
the activity, but suggest to have it in long term roadmap instead of
initiate release.
.. <MT> same comment here: I don't think the VNF is the virtual
resource. Virtual resources are the VMs, virtual switches, virtual
routers, virtual disks etc. The VNF uses them, but I don't think they
are equal. For example if by some reason the hypervisor is changed and
the current VMs cannot be migrated to the new hypervisor, they are
incompatible, then the VMs need to be upgraded too. This is not
something the NFVI user (i.e. VNFs ) would even know about.
Virtualization Facility Resources
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Based on the functionality they provide, virtualization facility
resources could be divided into computing node, networking node,
storage node and management node.
The possible upgrade objects in these nodes are addressed below:
(Note: hardware based virtualization may be considered as virtualization
facility resource, but from escalator perspective, it is better to
consider it as part of the hardware upgrade. )
**Computing node**
1. OS Kernel
2. Hypvervisor and virtual switch
3. Other kernel modules, like driver
4. User space software packages, like nova-compute agents and other
control plane programs.
Updating 1 and 2 will cause the loss of virtualzation functionality of
the compute node, which may lead to data plane services interruption
if the virtual resource is not redudant.
Updating 3 might result the same.
Updating 4 might lead to control plane services interruption if not an
HA deployment.
**Networking node**
1. OS kernel, optional, not all switches/routers allow the upgrade their
OS since it is more like a firmware than a generic OS.
2. User space software package, like neutron agents and other control
plane programs
Updating 1 if allowed will cause a node reboot and therefore leads to
data plane service interruption if the virtual resource is not
redundant.
Updating 2 might lead to control plane services interruption if not an
HA deployment.
**Storage node**
1. OS kernel, optional, not all storage nodes allow the upgrade their OS
since it is more like a firmware than a generic OS.
2. Kernel modules
3. User space software packages, control plane programs
Updating 1 if allowed will cause a node reboot and therefore leads to
data plane services interruption if the virtual resource is not
redundant.
Update 2 might result in the same.
Updating 3 might lead to control plane services interruption if not an
HA deployment.
**Management node**
1. OS Kernel
2. Kernel modules, like driver
3. User space software packages, like database, message queue and
control plane programs.
Updating 1 will cause a node reboot and therefore leads to control
plane services interruption if not an HA deployment. Updating 2 might
result in the same.
Updating 3 might lead to control plane services interruption if not an
HA deployment.
Upgrade Span
~~~~~~~~~~~~
**Major Upgrade**
Upgrades between major releases may introducing significant changes in
function, configuration and data, such as the upgrade of OPNFV from
Arno to Brahmaputra.
**Minor Upgrade**
Upgrades inside one major releases which would not leads to changing
the structure of the platform and may not infect the schema of the
system data.
Upgrade Granularity
~~~~~~~~~~~~~~~~~~~
Physical/Hardware Dimension
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Support full / partial upgrade for data centre, cluster, zone. Because
of the upgrade of a data centre or a zone, it may be divided into
several batches. The upgrade of a cloud environment (cluster) may also
be partial. For example, in one cloud environment running a number of
VNFs, we may just try one of them to check the stability and
performance, before we upgrade all of them.
Software Dimension
^^^^^^^^^^^^^^^^^^
- The upgrade of host OS or kernel may need a 'hot migration'
- The upgrade of OpenStack’s components
i.the one-shot upgrade of all components
ii.the partial upgrade (or bugfix patch) which only affects some
components (e.g., computing, storage, network, database, message
queue, etc.)
.. <MT> this section seems to overlap with 2.1.
I can see the following dimensions for the software.
.. <MT> different software packages
.. <MT> different functions - Considering that the target versions of all
software are compatible the upgrade needs to ensure that any
dependencies between SW and therefore packages are taken into account
in the upgrade plan, i.e. no version mismatch occurs during the
upgrade therefore dependencies are not broken
.. <MT> same function - This is an upgrade specific question if different
versions can coexist in the system when a SW is being upgraded from
one version to another. This is particularly important for stateful
functions e.g. storage, networking, control services. The upgrade
method must consider the compatibility of the redundant entities.
.. <MT> different versions of the same software package
.. <MT> major version changes - they may introduce incompatibilities. Even
when there are backward compatibility requirements changes may cause
issues at graceful roll-back
.. <MT> minor version changes - they must not introduce incompatibility
between versions, these should be primarily bug fixes, so live
patches should be possible
.. <MT> different installations of the same software package
.. <MT> using different installation options - they may reflect different
users with different needs so redundancy issues are less likely
between installations of different options; but they could be the
reflection of the heterogeneous system in which case they may provide
redundancy for higher availability, i.e. deeper inspection is needed
.. <MT> using the same installation options - they often reflect that the are
used by redundant entities across space
.. <MT> different distribution possibilities in space - same or different
availability zones, multi-site, geo-redundancy
.. <MT> different entities running from the same installation of a software
package
.. <MT> using different start-up options - they may reflect different users so
redundancy may not be an issues between them
.. <MT> using same start-up options - they often reflect redundant
entities
Upgrade duration
~~~~~~~~~~~~~~~~
As the OPNFV end-users are primarily Telecom operators, the network
services provided by the VNFs deployed on the NFVI should meet the
requirement of 'Carrier Grade'.::
In telecommunication, a "carrier grade" or"carrier class" refers to a
system, or a hardware or software component that is extremely reliable,
well tested and proven in its capabilities. Carrier grade systems are
tested and engineered to meet or exceed "five nines" high availability
standards, and provide very fast fault recovery through redundancy
(normally less than 50 milliseconds). [from wikipedia.org]
"five nines" means working all the time in ONE YEAR except 5'15".
::
We have learnt that a well prepared upgrade of OpenStack needs 10
minutes. The major time slot in the outage time is used spent on
synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
' by Symantec]
This 10 minutes of downtime of the OpenStack services however did not impact the
users, i.e. the VMs running on the compute nodes. This was the outage of
the control plane only. On the other hand with respect to the
preparations this was a manually tailored upgrade specific to the
particular deployment and the versions of each OpenStack service.
The project targets to achieve a more generic methodology, which however
requires that the upgrade objects fulfil certain requirements. Since
this is only possible on the long run we target first the upgrade
of the different VIM services from version to version.
**Questions:**
1. Can we manage to upgrade OPNFV in only 5 minutes?
.. <MT> The first question is whether we have the same carrier grade
requirement on the control plane as on the user plane. I.e. how
much control plane outage we can/willing to tolerate?
In the above case probably if the database is only half of the size
we can do the upgrade in 5 minutes, but is that good? It also means
that if the database is twice as much then the outage is 20
minutes.
For the user plane we should go for less as with two release yearly
that means 10 minutes outage per year.
.. <Malla> 10 minutes outage per year to the users? Plus, if we take
control plane into the consideration, then total outage will be
more than 10 minute in whole network, right?
.. <MT> The control plane outage does not have to cause outage to
the users, but it may of course depending on the size of the system
as it's more likely that there's a failure that needs to be handled
by the control plane.
2. Is it acceptable for end users ? Such as a planed service
interruption will lasting more than ten minutes for software
upgrade.
.. <MT> For user plane, no it's not acceptable in case of
carrier-grade. The 5' 15" downtime should include unplanned and
planned downtimes.
.. <Malla> I go agree with Maria, it is not acceptable.
3. Will any VNFs still working well when VIM is down?
.. <MT> In case of OpenStack it seems yes. .:)
The maximum duration of an upgrade
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The duration of an upgrade is related to and proportional with the
scale and the complexity of the OPNFV platform as well as the
granularity (in function and in space) of the upgrade.
.. <Malla> Also, if is a partial upgrade like module upgrade, it depends
also on the OPNFV modules and their tight connection entities as well.
.. <MT> Since the maintenance window is shrinking and becoming non-existent
the duration of the upgrade is secondary to the requirement of smooth upgrade.
But probably we want to be able to put a time constraint on each upgrade
during which it must complete otherwise it is considered failed and the system
should be rolled back. I.e. in case of automatic execution it might not be clear
if an upgrade is long or just hanging. The time constraints may be a function
of the size of the system in terms of the upgrade object(s).
The maximum duration of a roll back when an upgrade is failed
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The duration of a roll back is short than the corresponding upgrade. It
depends on the duration of restore the software and configure data from
pre-upgrade backup / snapshot.
.. <MT> During the upgrade process two types of failure may happen:
In case we can recover from the failure by undoing the upgrade
actions it is possible to roll back the already executed part of the
upgrade in graceful manner introducing no more service outage than
what was introduced during the upgrade. Such a graceful roll back
requires typically the same amount of time as the executed portion of
the upgrade and impose minimal state/data loss.
.. <MT> Requirement: It should be possible to roll back gracefully the
failed upgrade of stateful services of the control plane.
In case we cannot recover from the failure by just undoing the
upgrade actions, we have to restore the upgraded entities from their
backed up state. In other terms the system falls back to an earlier
state, which is typically a faster recovery procedure than graceful
roll back and depending on the statefulness of the entities involved it
may result in significant state/data loss.
.. <MT> Two possible types of failures can happen during an upgrade
.. <MT> We can recover from the failure that occurred in the upgrade process:
In this case, a graceful rolling back of the executed part of the
upgrade may be possible which would "undo" the executed part in a
similar fashion. Thus, such a roll back introduces no more service
outage during an upgrade than the executed part introduced. This
process typically requires the same amount of time as the executed
portion of the upgrade and impose minimal state/data loss.
.. <MT> We cannot recover from the failure that occurred in the upgrade
process: In this case, the system needs to fall back to an earlier
consistent state by reloading this backed-up state. This is typically
a faster recovery procedure than the graceful roll back, but can cause
state/data loss. The state/data loss usually depends on the
statefulness of the entities whose state is restored from the backup.
The maximum duration of a VNF interruption (Service outage)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Since not the entire process of a smooth upgrade will affect the VNFs,
the duration of the VNF interruption may be shorter than the duration
of the upgrade. In some cases, the VNF running without the control
from of the VIM is acceptable.
.. <MT> Should require explicitly that the NFVI should be able to
provide its services to the VNFs independent of the control plane?
.. <MT> Requirement: The upgrade of the control plane must not cause
interruption of the NFVI services provided to the VNFs.
.. <MT> With respect to carrier-grade the yearly service outage of the
VNF should not exceed 5' 15" regardless whether it is planned or
unplanned outage. Considering the HA requirements TL-9000 requires an
end-to-end service recovery time of 15 seconds based on which the ETSI
GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
availability levels (SAL). The proposed example service recovery times
for these levels are:
.. <MT> SAL1: 5-6 seconds
.. <MT> SAL2: 10-15 seconds
.. <MT> SAL3: 20-25 seconds
.. <Pva> my comment was actually that the downtime metrics of the
underlying elements, components and services are small fraction of the
total E2E service availability time. No-one on the E2E service path
will get the whole downtime allocation (in this context it includes
upgrade process related outages for the services provided by VIM etc.
elements that are subject to upgrade process).
.. <MT> So what you are saying is that the upgrade of any entity
(component, service) shouldn't cause even this much service
interruption. This was the reason I brought these figures here as well
that they are posing some kind of upper-upper boundary. Ideally the
interruption is in the millisecond range i.e. no more than a
switch-over or a live migration.
.. <MT> Requirement: Any interruption caused to the VNF by the upgrade
of the NFVI should be in the sub-second range.
.. <MT]> In the future we also need to consider the upgrade of the NFVI,
i.e. HW, firmware, hypervisors, host OS etc.
|