From f24247660e9f6539737d460c59ab66ec2068333b Mon Sep 17 00:00:00 2001 From: zhang-jun3g Date: Mon, 9 Nov 2015 16:31:16 +0800 Subject: Move files from doc to docs Move files to docs for automation html release. JIRA:ESCALATOR-27 Change-Id: I3654b18ad6c7fc94614fd55afe5e3140bf467752 Signed-off-by: zhang-jun3g --- docs/02-Background_and_Terminologies.rst | 517 +++++++++++++++++++++++++++++++ 1 file changed, 517 insertions(+) create mode 100644 docs/02-Background_and_Terminologies.rst (limited to 'docs/02-Background_and_Terminologies.rst') diff --git a/docs/02-Background_and_Terminologies.rst b/docs/02-Background_and_Terminologies.rst new file mode 100644 index 0000000..36a81f2 --- /dev/null +++ b/docs/02-Background_and_Terminologies.rst @@ -0,0 +1,517 @@ +General Requirements Background and Terminology +----------------------------------------------- + +Terminologies and definitions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +NFVI + The term is an abbreviation for Network Function Virtualization + Infrastructure; sometimes it is also referred as data plane in this + document. + +VIM + The term is an abbreviation for Virtual Infrastructure Management; + sometimes it is also referred as control plane in this document. + +Operator + The term refers to network service providers and Virtual Network + Function (VNF) providers. + +End-User + The term refers to a subscriber of the Operator's services. + +Network Service + The term refers to a service provided by an Operator to its + End-users using a set of (virtualized) Network Functions + +Infrastructure Services + The term refers to services provided by the NFV Infrastructure and the + the Management & Orchestration functions to the VNFs. I.e. + these are the virtual resources as perceived by the VNFs. + +Smooth Upgrade + The term refers to an upgrade that results in no service outage + for the end-users. + +Rolling Upgrade + The term refers to an upgrade strategy that upgrades each node or + a subset of nodes in a wave style rolling through the data centre. It + is a popular upgrade strategy to maintain service availability. + +Parallel Universe Upgrade + The term refers to an upgrade strategy that creates and deploys + a new universe - a system with the new configuration - while the old + system continues running. The state of the old system is transferred + to the new system after sufficient testing of the new system. + +Infrastructure Resource Model + The term refers to the representation of infrastructure resources, + namely: the physical resources, the virtualization + facility resources and the virtual resources. + +Physical Resource + The term refers to a hardware pieces of the NFV infrastructure, which may + also include the firmware which enables the hardware. + +Virtual Resource + The term refers to a resource, which is provided as services built on top + of the physical resources via the virtualization facilities; in particular, + they are the resources on which VNF entities are deployed, e.g. + the VMs, virtual switches, virtual routers, virtual disks etc. + +Visualization Facility + The term refers to a resource that enables the creation + of virtual environments on top of the physical resources, e.g. + hypervisor, OpenStack, etc. + +Upgrade Campaign + The term refers to a choreography that describes how the upgrade should + be performed in terms of its targets (i.e. upgrade objects), the + steps/actions required of upgrading each, and the coordination of these + steps so that service availability can be maintained. It is an input to an + upgrade tool (Escalator) to carry out the upgrade. + +Upgrade Duration + The duration of an upgrade characterized by the time elapsed between its + initiation and its completion. E.g. from the moment the execution of an + upgrade campaign has started until it has been committed. Depending on + the upgrade method and its target some parts of the system may be in a more + vulnerable state. + +Outage + The period of time during which a given service is not provided is referred + as the outage of that given service. If a subsystem or the entire system + does not provide any service, it is the outage of the given subsystem or the + system. Smooth upgrade means upgrade with no outage for the user plane, i.e. + no VNF should experience service outage. + +Rollback + The term refers to a failure handling strategy that reverts the changes + done by a potentially failed upgrade execution one by one in a reverse order. + I.e. it is like undoing the changes done by the upgrade. + +Restore + The term refers to a failure handling strategy that reverts the changes + done by an upgrade by restoring the system from some backup data. This + results in the loss of any data persisted since the backup has been taken. + +Rollforward + The term refers to a failure handling strategy applied after a restore + (from a backup) opertaion to recover any loss of data persisted between + the time the backup has been taken and the moment it is restored. Rollforward + requires that data that needs to survive the restore operation is logged at + a location not impacted by the restore so that it can be re-applied to the + system after its restoration from the backup. + +Downgrade + The term refers to an upgrade in which an earlier version of the software + is restored through the upgrade procedure. A system can be downgraded to any + earlier version and the compatibility of the versions will determine the + applicable upgrade strategies and whether service outage can be avoided. + In particular any data conversion needs special attention. + + + +Upgrade Objects +~~~~~~~~~~~~~~~ + +Physical Resource +^^^^^^^^^^^^^^^^^ + +Most cloud infrastructures support the dynamic addition/removal of +hardware. Accordingly a hardware upgrade could be done by adding the new +piece of hardware and removing the old one. From the persepctive of smooth +upgrade the orchestration/scheduling of this actions is the primary concern. +Upgrading a physical resource may involve as well the upgrade of its firmware +and/or modifying its configuration data. This may require the restart of the +hardware. + + + +Virtual Resources +^^^^^^^^^^^^^^^^^ + +Addition and removal of virtual resources may be initiated by the users or be +a result of an elasticity action. Users may also request the upgrade of their +virtual resources using a new VM image. + +.. Needs to be moved to requirement section: Escalator should facilitate such an +option and allow for a smooth upgrade. + +On the other hand changes in the infrastructure, namely, in the hardware and/or +the virtualization facility resources may result in the upgrade of the virtual +resources. For example if by some reason the hypervisor is changed and +the current VMs cannot be migrated to the new hypervisor - they are +incompatible - then the VMs need to be upgraded too. This is not +something the NFVI user (i.e. VNFs ) would know about. In such cases +smooth upgrade is essential. + + +Virtualization Facility Resources +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Based on the functionality they provide, virtualization facility +resources could be divided into computing node, networking node, +storage node and management node. + +The possible upgrade objects in these nodes are addressed below: +(Note: hardware based virtualization may be considered as virtualization +facility resource, but from escalator perspective, it is better to +consider it as part of the hardware upgrade. ) + +**Computing node** + +1. OS Kernel + +2. Hypvervisor and virtual switch + +3. Other kernel modules, like driver + +4. User space software packages, like nova-compute agents and other + control plane programs. + +Updating 1 and 2 will cause the loss of virtualzation functionality of +the compute node, which may lead to data plane services interruption +if the virtual resource is not redudant. + +Updating 3 might result the same. + +Updating 4 might lead to control plane services interruption if not an +HA deployment. + +**Networking node** + +1. OS kernel, optional, not all switches/routers allow the upgrade their + OS since it is more like a firmware than a generic OS. + +2. User space software package, like neutron agents and other control + plane programs + +Updating 1 if allowed will cause a node reboot and therefore leads to +data plane service interruption if the virtual resource is not +redundant. + +Updating 2 might lead to control plane services interruption if not an +HA deployment. + +**Storage node** + +1. OS kernel, optional, not all storage nodes allow the upgrade their OS + since it is more like a firmware than a generic OS. + +2. Kernel modules + +3. User space software packages, control plane programs + +Updating 1 if allowed will cause a node reboot and therefore leads to +data plane services interruption if the virtual resource is not +redundant. + +Update 2 might result in the same. + +Updating 3 might lead to control plane services interruption if not an +HA deployment. + +**Management node** + +1. OS Kernel + +2. Kernel modules, like driver + +3. User space software packages, like database, message queue and + control plane programs. + +Updating 1 will cause a node reboot and therefore leads to control +plane services interruption if not an HA deployment. Updating 2 might +result in the same. + +Updating 3 might lead to control plane services interruption if not an +HA deployment. + + + + + +Upgrade Granularity +~~~~~~~~~~~~~~~~~~~ + +The granularity of an upgrade can be characterized from two perspective: +- the physical dimension and +- the software dimension + + +Physical Dimension +^^^^^^^^^^^^^^^^^^ + +The physical dimension characterizes the number of similar upgrade objects +targeted by the upgrade, i.e. whether it is full / partial upgrade of a +data centre, cluster, zone. +Because of the upgrade of a data centre or a zone, it may be divided into +several batches. Thus there is a need for efficiency in the execution of +upgrades of potentially huge number of upgrade objects while still maintain +availability to fulfill the requirement of smooth upgrade. + +The upgrade of a cloud environment (cluster) may also +be partial. For example, in one cloud environment running a number of +VNFs, we may just try to upgrade one of them to check the stability and +performance, before we upgrade all of them. +Thus there is a need for proper organization of the artifacts associated with +the different upgrade objects. Also the different versions should be able +to coextist beyond the upgrade period. + +From this perspective special attention may be needed when upgrading +objects that are collaborating in a redundancy schema as in this case +different versions not only need to coexist but also collaborate. This +puts requirement on the upgrade objects primarily. If this is not possible +the upgrade campaign should be designed in such a way that the proper +isolation is ensured. + +Software Dimension +^^^^^^^^^^^^^^^^^^ + +The software dimension of the upgrade characterizes the upgrade object +type targeted and the combination in which they are upgraded together. + +Even though the upgrade may +initially target only one type of upgrade object, e.g. the hypervisor +the dependency of other upgrade objects on this initial target object may +require their upgrade as well. I.e. the upgrades need to be combined. From this +perspective the main concern is compatibility of the dependent and +sponsor objects. To take into consideration of these dependencies +they need to be described together with the version compatility information. +Breaking dependencies is the major cause of outages during upgrades. + +In other cases it is more efficient to upgrade a combination of upgrade +objects than to do it one by one. One aspect of the combination is how +the upgrade packages can be combined, whether a new image can be created for +them before hand or the different packages can be installed during the upgrade +independently, but activated together. + +The combination of upgrade objects may span across +layers (e.g. software stack in the host and the VM of the VNF). +Thus, it may require additional coordination between the management layers. + +With respect to each upgrade object type and even stacks we can +distingush major and minor upgrades: + +**Major Upgrade** + +Upgrades between major releases may introducing significant changes in +function, configuration and data, such as the upgrade of OPNFV from +Arno to Brahmaputra. + +**Minor Upgrade** + +Upgrades inside one major releases which would not leads to changing +the structure of the platform and may not infect the schema of the +system data. + +Scope of Impact +~~~~~~~~~~~~~~~ + +Considering availability and therefore smooth upgrade, one of the major +concerns is the predictability and control of the outcome of the different +upgrade operations. Ideally an upgrade can be performed without impacting any +entity in the system, which means none of the operations change or potentially +change the behaviour of any entity in the system in an uncotrolled manner. +Accordingly the operations of such an upgrade can be performed any time while +the system is running, while all the entities are online. No entity needs to be +taken offline to avoid such adverse effects. Hence such upgrade operations +are referred as online operations. The effects of the upgrade might be activated +next time it is used, or may require a special activation action such as a +restart. Note that the activation action provides more control and predictability. + +If an entity's behavior in the system may change due to the upgrade it may +be better to take it offline for the time of the relevant upgrade operations. +The main question is however considering the hosting relation of an upgrade +object what hosted entities are impacted. Accordingly we can identify a scope +which is impacted by taking the given upgrade object offline. The entities +that are in the scope of impact may need to be taken offline or moved out of +this scope i.e. migrated. + +If the impacted entity is in a different layer managed by another manager +this may require coordination because taking out of service some +infrastructure resources for the time of their upgrade which support virtual +resources used by VNFs that should not experience outages. The hosted VNFs +may or may not allow for the hot migration of their VMs. In case of migration +the VMs placement policy should be considered. + + + +Upgrade duration +~~~~~~~~~~~~~~~~ + +As the OPNFV end-users are primarily Telecom operators, the network +services provided by the VNFs deployed on the NFVI should meet the +requirement of 'Carrier Grade'.:: + + In telecommunication, a "carrier grade" or"carrier class" refers to a + system, or a hardware or software component that is extremely reliable, + well tested and proven in its capabilities. Carrier grade systems are + tested and engineered to meet or exceed "five nines" high availability + standards, and provide very fast fault recovery through redundancy + (normally less than 50 milliseconds). [from wikipedia.org] + +"five nines" means working all the time in ONE YEAR except 5'15". + +:: + + We have learnt that a well prepared upgrade of OpenStack needs 10 + minutes. The major time slot in the outage time is used spent on + synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done! + ' by Symantec] + +This 10 minutes of downtime of the OpenStack services however did not impact the +users, i.e. the VMs running on the compute nodes. This was the outage of +the control plane only. On the other hand with respect to the +preparations this was a manually tailored upgrade specific to the +particular deployment and the versions of each OpenStack service. + +The project targets to achieve a more generic methodology, which however +requires that the upgrade objects fulfil certain requirements. Since +this is only possible on the long run we target first the upgrade +of the different VIM services from version to version. + +**Questions:** + +1. Can we manage to upgrade OPNFV in only 5 minutes? + +.. The first question is whether we have the same carrier grade + requirement on the control plane as on the user plane. I.e. how + much control plane outage we can/willing to tolerate? + In the above case probably if the database is only half of the size + we can do the upgrade in 5 minutes, but is that good? It also means + that if the database is twice as much then the outage is 20 + minutes. + For the user plane we should go for less as with two release yearly + that means 10 minutes outage per year. + +.. 10 minutes outage per year to the users? Plus, if we take + control plane into the consideration, then total outage will be + more than 10 minute in whole network, right? + +.. The control plane outage does not have to cause outage to + the users, but it may of course depending on the size of the system + as it's more likely that there's a failure that needs to be handled + by the control plane. + +2. Is it acceptable for end users ? Such as a planed service + interruption will lasting more than ten minutes for software + upgrade. + +.. For user plane, no it's not acceptable in case of + carrier-grade. The 5' 15" downtime should include unplanned and + planned downtimes. + +.. I go agree with Maria, it is not acceptable. + +3. Will any VNFs still working well when VIM is down? + +.. In case of OpenStack it seems yes. .:) + +The maximum duration of an upgrade +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The duration of an upgrade is related to and proportional with the +scale and the complexity of the OPNFV platform as well as the +granularity (in function and in space) of the upgrade. + +.. Also, if is a partial upgrade like module upgrade, it depends + also on the OPNFV modules and their tight connection entities as well. + +.. Since the maintenance window is shrinking and becoming non-existent + the duration of the upgrade is secondary to the requirement of smooth upgrade. + But probably we want to be able to put a time constraint on each upgrade + during which it must complete otherwise it is considered failed and the system + should be rolled back. I.e. in case of automatic execution it might not be clear + if an upgrade is long or just hanging. The time constraints may be a function + of the size of the system in terms of the upgrade object(s). + +The maximum duration of a roll back when an upgrade is failed +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The duration of a roll back is short than the corresponding upgrade. It +depends on the duration of restore the software and configure data from +pre-upgrade backup / snapshot. + +.. During the upgrade process two types of failure may happen: + In case we can recover from the failure by undoing the upgrade + actions it is possible to roll back the already executed part of the + upgrade in graceful manner introducing no more service outage than + what was introduced during the upgrade. Such a graceful roll back + requires typically the same amount of time as the executed portion of + the upgrade and impose minimal state/data loss. + +.. Requirement: It should be possible to roll back gracefully the + failed upgrade of stateful services of the control plane. + In case we cannot recover from the failure by just undoing the + upgrade actions, we have to restore the upgraded entities from their + backed up state. In other terms the system falls back to an earlier + state, which is typically a faster recovery procedure than graceful + roll back and depending on the statefulness of the entities involved it + may result in significant state/data loss. + +.. Two possible types of failures can happen during an upgrade + +.. We can recover from the failure that occurred in the upgrade process: + In this case, a graceful rolling back of the executed part of the + upgrade may be possible which would "undo" the executed part in a + similar fashion. Thus, such a roll back introduces no more service + outage during an upgrade than the executed part introduced. This + process typically requires the same amount of time as the executed + portion of the upgrade and impose minimal state/data loss. + +.. We cannot recover from the failure that occurred in the upgrade + process: In this case, the system needs to fall back to an earlier + consistent state by reloading this backed-up state. This is typically + a faster recovery procedure than the graceful roll back, but can cause + state/data loss. The state/data loss usually depends on the + statefulness of the entities whose state is restored from the backup. + +The maximum duration of a VNF interruption (Service outage) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Since not the entire process of a smooth upgrade will affect the VNFs, +the duration of the VNF interruption may be shorter than the duration +of the upgrade. In some cases, the VNF running without the control +from of the VIM is acceptable. + +.. Should require explicitly that the NFVI should be able to + provide its services to the VNFs independent of the control plane? + +.. Requirement: The upgrade of the control plane must not cause + interruption of the NFVI services provided to the VNFs. + +.. With respect to carrier-grade the yearly service outage of the + VNF should not exceed 5' 15" regardless whether it is planned or + unplanned outage. Considering the HA requirements TL-9000 requires an + end-to-end service recovery time of 15 seconds based on which the ETSI + GS NFV-REL 001 V1.1.1 (2015-01) document defines three service + availability levels (SAL). The proposed example service recovery times + for these levels are: + +.. SAL1: 5-6 seconds + +.. SAL2: 10-15 seconds + +.. SAL3: 20-25 seconds + +.. my comment was actually that the downtime metrics of the + underlying elements, components and services are small fraction of the + total E2E service availability time. No-one on the E2E service path + will get the whole downtime allocation (in this context it includes + upgrade process related outages for the services provided by VIM etc. + elements that are subject to upgrade process). + +.. So what you are saying is that the upgrade of any entity + (component, service) shouldn't cause even this much service + interruption. This was the reason I brought these figures here as well + that they are posing some kind of upper-upper boundary. Ideally the + interruption is in the millisecond range i.e. no more than a + switch-over or a live migration. + +.. Requirement: Any interruption caused to the VNF by the upgrade + of the NFVI should be in the sub-second range. + +.. In the future we also need to consider the upgrade of the NFVI, + i.e. HW, firmware, hypervisors, host OS etc. \ No newline at end of file -- cgit 1.2.3-korg