diff options
-rw-r--r-- | INFO | 1 | ||||
-rw-r--r-- | doc/00-Authors.rst | 15 | ||||
-rw-r--r-- | doc/01-Scope.rst | 28 | ||||
-rw-r--r-- | doc/02-Background_and_Terminologies.rst | 458 | ||||
-rw-r--r-- | doc/03-Functional_Requirements.rst | 240 | ||||
-rw-r--r-- | doc/04-Use_Cases_and_Scenarios.rst | 32 | ||||
-rw-r--r-- | doc/05-Reference_Architecture.rst | 6 | ||||
-rw-r--r-- | doc/06-Information_Flows.rst | 8 | ||||
-rw-r--r-- | doc/07-Interfaces_and_Files.rst | 27 | ||||
-rw-r--r-- | doc/08-Requirements_from_other_OPNFV_Project.rst | 40 | ||||
-rw-r--r-- | doc/09-Reference.rst | 17 | ||||
-rw-r--r-- | doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst | 11 | ||||
-rw-r--r-- | doc/A1-Appendix.rst | 49 | ||||
-rw-r--r-- | doc/Escalator_Requirement.rst | 814 |
14 files changed, 932 insertions, 814 deletions
@@ -27,6 +27,7 @@ huangzhipeng@huawei.com meng.jia@zte.com.cn liyi.meng@ericsson.com pasi.vaananen@stratus.com +wang.guobing1@zte.com.cn Link to TSC approval of the project: http://meetbot.opnfv.org/meetings/opnfv-meeting/2015/opnfv-meeting.2015-04-21-14.00.html diff --git a/doc/00-Authors.rst b/doc/00-Authors.rst new file mode 100644 index 0000000..fdbf61b --- /dev/null +++ b/doc/00-Authors.rst @@ -0,0 +1,15 @@ +Authors: +-------- + +| Jie Hu (ZTE, hu.jie@zte.com.cn) +| Qiao Fu (China Mobile, fuqiao@chinamobile.com) +| Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com) +| Maria Toeroe (Ericsson, maria.toeroe@ericsson.com) +| Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com) +| Zhong Chao (ZTE, chao.zhong@zte.com.cn) +| Julien Zhang (ZTE, zhang.jun3g@zte.com.cn) +| Yuri Yuan (ZTE, yuan.yue@zte.com.cn) +| Zhipeng Huang (Huawei, huangzhipeng@huawei.com) +| Jia Meng (ZTE, meng.jia@zte.com.cn) +| Liyi Meng (Ericsson, liyi.meng@ericsson.com) +| Pasi Vaananen (Stratus, pasi.vaananen@stratus.com)
\ No newline at end of file diff --git a/doc/01-Scope.rst b/doc/01-Scope.rst new file mode 100644 index 0000000..5247e40 --- /dev/null +++ b/doc/01-Scope.rst @@ -0,0 +1,28 @@ +Scope +----- + +This document describes the user requirements on the smooth upgrade +function of the NFVI and VIM with respect to the upgrades of the OPNFV +platform from one version to another. Smooth upgrade means that the +upgrade results in no service outage for the end-users. This requires +that the process of the upgrade is automatically carried out by a tool +(code name: Escalator) with pre-configured data. The upgrade process +includes preparation, validation, execution, monitoring and +conclusion. + +.. <MT> While it is good to have a tool for the entire upgrade process, + but it is a challenging task, so maybe we shouldn't require automation + for the entire process right away. Automation is essential at + execution. + +.. <hujie> Maybe we can analysis information flows of the upgrade tool, + abstract the basic / essential actions from the tool (or tools), and + map them to a command set of NFVI / VIM's interfaces. + +The requirements are defined in a stepwise approach, i.e. in the first +phase focusing on the upgrade of the VIM then widening the scope to the +NFVI. + +The requirements may apply to different NFV functions (NFVI, or VIM, or +both of them). They will be classified in the Appendix of this +document.
\ No newline at end of file diff --git a/doc/02-Background_and_Terminologies.rst b/doc/02-Background_and_Terminologies.rst new file mode 100644 index 0000000..afb392f --- /dev/null +++ b/doc/02-Background_and_Terminologies.rst @@ -0,0 +1,458 @@ +General Requirements Background and Terminology
+-----------------------------------------------
+
+Terminologies and definitions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NFVI
+ The term is an abbreviation for Network Function Virtualization
+ Infrastructure; sometimes it is also referred as data plane in this
+ document.
+
+VIM
+ The term is an abbreviation for Virtual Infrastructure Management;
+ sometimes it is also referred as control plane in this document.
+
+Operator
+ The term refers to network service providers and Virtual Network
+ Function (VNF) providers.
+
+End-User
+ The term refers to a subscriber of the Operator's services.
+
+Network Service
+ The term refers to a service provided by an Operator to its
+ End-users using a set of (virtualized) Network Functions
+
+Infrastructure Services
+ The term refers to services provided by the NFV Infrastructure and the
+ the Management & Orchestration functions to the VNFs. I.e.
+ these are the virtual resources as perceived by the VNFs.
+
+Smooth Upgrade
+ The term refers to an upgrade that results in no service outage
+ for the end-users.
+
+Rolling Upgrade
+ The term refers to an upgrade strategy that upgrades each node or
+ a subset of nodes in a wave style rolling through the data centre. It
+ is a popular upgrade strategy to maintain service availability.
+
+Parallel Universe
+ The term refers to an upgrade strategy that creates and deploys
+ a new universe - a system with the new configuration - while the old
+ system continues running. The state of the old system is transferred
+ to the new system after sufficient testing of the new system.
+
+Infrastructure Resource Model
+ The term refers to the representation of infrastructure resources,
+ namely: the physical resources, the virtualization
+ facility resources and the virtual resources.
+
+Physical Resource
+ The term refers to a hardware pieces of the NFV infrastructure, which may
+ also include the firmware which enables the hardware.
+
+Virtual Resource
+ The term refers to a resource, which is provided as services built on top
+ of the physical resources via the virtualization facilities; in particular,
+ they are the resources on which VNF entities are deployed, e.g.
+ the VMs, virtual switches, virtual routers, virtual disks etc.
+
+.. <MT> I don't think the VNF is the virtual resource. Virtual
+ resources are the VMs, virtual switches, virtual routers, virtual
+ disks etc. The VNF uses them, but I don't think they are equal. The
+ VIM doesn't manage the VNF, but it does manage virtual resources.
+
+Visualization Facility
+ The term refers to a resource that enables the creation
+ of virtual environments on top of the physical resources, e.g.
+ hypervisor, OpenStack, etc.
+
+Upgrade Plan (or Campaign?)
+ The term refers to a choreography that describes how the upgrade should
+ be performed in terms of its targets (i.e. upgrade objects), the
+ steps/actions required of upgrading each, and the coordination of these
+ steps so that service availability can be maintained. It is an input to an
+ upgrade tool (Escalator) to carry out the upgrade
+
+
+Upgrade Objects
+~~~~~~~~~~~~~~~
+
+Physical Resource
+^^^^^^^^^^^^^^^^^
+
+Most of cloud infrastructures support dynamic addition/removal of
+hardware. A hardware upgrade could be done by adding the new
+hardware node and removing the old one. From the persepctive of smooth
+upgrade the orchestration/scheduling of this actions is the primary concern.
+Upgrading a physical resource,
+like upgrading its firmware and/or modify its configuration data, may
+also be considered in the future.
+
+
+Virtual Resources
+^^^^^^^^^^^^^^^^^
+
+Virtual resource upgrade mainly done by users. OPNFV may facilitate
+the activity, but suggest to have it in long term roadmap instead of
+initiate release.
+
+.. <MT> same comment here: I don't think the VNF is the virtual
+ resource. Virtual resources are the VMs, virtual switches, virtual
+ routers, virtual disks etc. The VNF uses them, but I don't think they
+ are equal. For example if by some reason the hypervisor is changed and
+ the current VMs cannot be migrated to the new hypervisor, they are
+ incompatible, then the VMs need to be upgraded too. This is not
+ something the NFVI user (i.e. VNFs ) would even know about.
+
+
+Virtualization Facility Resources
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Based on the functionality they provide, virtualization facility
+resources could be divided into computing node, networking node,
+storage node and management node.
+
+The possible upgrade objects in these nodes are addressed below:
+(Note: hardware based virtualization may be considered as virtualization
+facility resource, but from escalator perspective, it is better to
+consider it as part of the hardware upgrade. )
+
+**Computing node**
+
+1. OS Kernel
+
+2. Hypvervisor and virtual switch
+
+3. Other kernel modules, like driver
+
+4. User space software packages, like nova-compute agents and other
+ control plane programs.
+
+Updating 1 and 2 will cause the loss of virtualzation functionality of
+the compute node, which may lead to data plane services interruption
+if the virtual resource is not redudant.
+
+Updating 3 might result the same.
+
+Updating 4 might lead to control plane services interruption if not an
+HA deployment.
+
+**Networking node**
+
+1. OS kernel, optional, not all switches/routers allow the upgrade their
+ OS since it is more like a firmware than a generic OS.
+
+2. User space software package, like neutron agents and other control
+ plane programs
+
+Updating 1 if allowed will cause a node reboot and therefore leads to
+data plane service interruption if the virtual resource is not
+redundant.
+
+Updating 2 might lead to control plane services interruption if not an
+HA deployment.
+
+**Storage node**
+
+1. OS kernel, optional, not all storage nodes allow the upgrade their OS
+ since it is more like a firmware than a generic OS.
+
+2. Kernel modules
+
+3. User space software packages, control plane programs
+
+Updating 1 if allowed will cause a node reboot and therefore leads to
+data plane services interruption if the virtual resource is not
+redundant.
+
+Update 2 might result in the same.
+
+Updating 3 might lead to control plane services interruption if not an
+HA deployment.
+
+**Management node**
+
+1. OS Kernel
+
+2. Kernel modules, like driver
+
+3. User space software packages, like database, message queue and
+ control plane programs.
+
+Updating 1 will cause a node reboot and therefore leads to control
+plane services interruption if not an HA deployment. Updating 2 might
+result in the same.
+
+Updating 3 might lead to control plane services interruption if not an
+HA deployment.
+
+Upgrade Span
+~~~~~~~~~~~~
+
+**Major Upgrade**
+
+Upgrades between major releases may introducing significant changes in
+function, configuration and data, such as the upgrade of OPNFV from
+Arno to Brahmaputra.
+
+**Minor Upgrade**
+
+Upgrades inside one major releases which would not leads to changing
+the structure of the platform and may not infect the schema of the
+system data.
+
+Upgrade Granularity
+~~~~~~~~~~~~~~~~~~~
+
+Physical/Hardware Dimension
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Support full / partial upgrade for data centre, cluster, zone. Because
+of the upgrade of a data centre or a zone, it may be divided into
+several batches. The upgrade of a cloud environment (cluster) may also
+be partial. For example, in one cloud environment running a number of
+VNFs, we may just try one of them to check the stability and
+performance, before we upgrade all of them.
+
+Software Dimension
+^^^^^^^^^^^^^^^^^^
+
+- The upgrade of host OS or kernel may need a 'hot migration'
+- The upgrade of OpenStack’s components
+
+ i.the one-shot upgrade of all components
+
+ ii.the partial upgrade (or bugfix patch) which only affects some
+ components (e.g., computing, storage, network, database, message
+ queue, etc.)
+
+.. <MT> this section seems to overlap with 2.1.
+ I can see the following dimensions for the software.
+
+.. <MT> different software packages
+
+.. <MT> different functions - Considering that the target versions of all
+ software are compatible the upgrade needs to ensure that any
+ dependencies between SW and therefore packages are taken into account
+ in the upgrade plan, i.e. no version mismatch occurs during the
+ upgrade therefore dependencies are not broken
+
+.. <MT> same function - This is an upgrade specific question if different
+ versions can coexist in the system when a SW is being upgraded from
+ one version to another. This is particularly important for stateful
+ functions e.g. storage, networking, control services. The upgrade
+ method must consider the compatibility of the redundant entities.
+
+.. <MT> different versions of the same software package
+
+.. <MT> major version changes - they may introduce incompatibilities. Even
+ when there are backward compatibility requirements changes may cause
+ issues at graceful roll-back
+
+.. <MT> minor version changes - they must not introduce incompatibility
+ between versions, these should be primarily bug fixes, so live
+ patches should be possible
+
+.. <MT> different installations of the same software package
+
+.. <MT> using different installation options - they may reflect different
+ users with different needs so redundancy issues are less likely
+ between installations of different options; but they could be the
+ reflection of the heterogeneous system in which case they may provide
+ redundancy for higher availability, i.e. deeper inspection is needed
+
+.. <MT> using the same installation options - they often reflect that the are
+ used by redundant entities across space
+
+.. <MT> different distribution possibilities in space - same or different
+ availability zones, multi-site, geo-redundancy
+
+.. <MT> different entities running from the same installation of a software
+ package
+
+.. <MT> using different start-up options - they may reflect different users so
+ redundancy may not be an issues between them
+
+.. <MT> using same start-up options - they often reflect redundant
+ entities
+
+Upgrade duration
+~~~~~~~~~~~~~~~~
+
+As the OPNFV end-users are primarily Telecom operators, the network
+services provided by the VNFs deployed on the NFVI should meet the
+requirement of 'Carrier Grade'.::
+
+ In telecommunication, a "carrier grade" or"carrier class" refers to a
+ system, or a hardware or software component that is extremely reliable,
+ well tested and proven in its capabilities. Carrier grade systems are
+ tested and engineered to meet or exceed "five nines" high availability
+ standards, and provide very fast fault recovery through redundancy
+ (normally less than 50 milliseconds). [from wikipedia.org]
+
+"five nines" means working all the time in ONE YEAR except 5'15".
+
+::
+
+ We have learnt that a well prepared upgrade of OpenStack needs 10
+ minutes. The major time slot in the outage time is used spent on
+ synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
+ ' by Symantec]
+
+This 10 minutes of downtime of the OpenStack services however did not impact the
+users, i.e. the VMs running on the compute nodes. This was the outage of
+the control plane only. On the other hand with respect to the
+preparations this was a manually tailored upgrade specific to the
+particular deployment and the versions of each OpenStack service.
+
+The project targets to achieve a more generic methodology, which however
+requires that the upgrade objects fulfil certain requirements. Since
+this is only possible on the long run we target first the upgrade
+of the different VIM services from version to version.
+
+**Questions:**
+
+1. Can we manage to upgrade OPNFV in only 5 minutes?
+
+.. <MT> The first question is whether we have the same carrier grade
+ requirement on the control plane as on the user plane. I.e. how
+ much control plane outage we can/willing to tolerate?
+ In the above case probably if the database is only half of the size
+ we can do the upgrade in 5 minutes, but is that good? It also means
+ that if the database is twice as much then the outage is 20
+ minutes.
+ For the user plane we should go for less as with two release yearly
+ that means 10 minutes outage per year.
+
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take
+ control plane into the consideration, then total outage will be
+ more than 10 minute in whole network, right?
+
+.. <MT> The control plane outage does not have to cause outage to
+ the users, but it may of course depending on the size of the system
+ as it's more likely that there's a failure that needs to be handled
+ by the control plane.
+
+2. Is it acceptable for end users ? Such as a planed service
+ interruption will lasting more than ten minutes for software
+ upgrade.
+
+.. <MT> For user plane, no it's not acceptable in case of
+ carrier-grade. The 5' 15" downtime should include unplanned and
+ planned downtimes.
+
+.. <Malla> I go agree with Maria, it is not acceptable.
+
+3. Will any VNFs still working well when VIM is down?
+
+.. <MT> In case of OpenStack it seems yes. .:)
+
+The maximum duration of an upgrade
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The duration of an upgrade is related to and proportional with the
+scale and the complexity of the OPNFV platform as well as the
+granularity (in function and in space) of the upgrade.
+
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends
+ also on the OPNFV modules and their tight connection entities as well.
+
+.. <MT> Since the maintenance window is shrinking and becoming non-existent
+ the duration of the upgrade is secondary to the requirement of smooth upgrade.
+ But probably we want to be able to put a time constraint on each upgrade
+ during which it must complete otherwise it is considered failed and the system
+ should be rolled back. I.e. in case of automatic execution it might not be clear
+ if an upgrade is long or just hanging. The time constraints may be a function
+ of the size of the system in terms of the upgrade object(s).
+
+The maximum duration of a roll back when an upgrade is failed
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The duration of a roll back is short than the corresponding upgrade. It
+depends on the duration of restore the software and configure data from
+pre-upgrade backup / snapshot.
+
+.. <MT> During the upgrade process two types of failure may happen:
+ In case we can recover from the failure by undoing the upgrade
+ actions it is possible to roll back the already executed part of the
+ upgrade in graceful manner introducing no more service outage than
+ what was introduced during the upgrade. Such a graceful roll back
+ requires typically the same amount of time as the executed portion of
+ the upgrade and impose minimal state/data loss.
+
+.. <MT> Requirement: It should be possible to roll back gracefully the
+ failed upgrade of stateful services of the control plane.
+ In case we cannot recover from the failure by just undoing the
+ upgrade actions, we have to restore the upgraded entities from their
+ backed up state. In other terms the system falls back to an earlier
+ state, which is typically a faster recovery procedure than graceful
+ roll back and depending on the statefulness of the entities involved it
+ may result in significant state/data loss.
+
+.. <MT> Two possible types of failures can happen during an upgrade
+
+.. <MT> We can recover from the failure that occurred in the upgrade process:
+ In this case, a graceful rolling back of the executed part of the
+ upgrade may be possible which would "undo" the executed part in a
+ similar fashion. Thus, such a roll back introduces no more service
+ outage during an upgrade than the executed part introduced. This
+ process typically requires the same amount of time as the executed
+ portion of the upgrade and impose minimal state/data loss.
+
+.. <MT> We cannot recover from the failure that occurred in the upgrade
+ process: In this case, the system needs to fall back to an earlier
+ consistent state by reloading this backed-up state. This is typically
+ a faster recovery procedure than the graceful roll back, but can cause
+ state/data loss. The state/data loss usually depends on the
+ statefulness of the entities whose state is restored from the backup.
+
+The maximum duration of a VNF interruption (Service outage)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since not the entire process of a smooth upgrade will affect the VNFs,
+the duration of the VNF interruption may be shorter than the duration
+of the upgrade. In some cases, the VNF running without the control
+from of the VIM is acceptable.
+
+.. <MT> Should require explicitly that the NFVI should be able to
+ provide its services to the VNFs independent of the control plane?
+
+.. <MT> Requirement: The upgrade of the control plane must not cause
+ interruption of the NFVI services provided to the VNFs.
+
+.. <MT> With respect to carrier-grade the yearly service outage of the
+ VNF should not exceed 5' 15" regardless whether it is planned or
+ unplanned outage. Considering the HA requirements TL-9000 requires an
+ end-to-end service recovery time of 15 seconds based on which the ETSI
+ GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
+ availability levels (SAL). The proposed example service recovery times
+ for these levels are:
+
+.. <MT> SAL1: 5-6 seconds
+
+.. <MT> SAL2: 10-15 seconds
+
+.. <MT> SAL3: 20-25 seconds
+
+.. <Pva> my comment was actually that the downtime metrics of the
+ underlying elements, components and services are small fraction of the
+ total E2E service availability time. No-one on the E2E service path
+ will get the whole downtime allocation (in this context it includes
+ upgrade process related outages for the services provided by VIM etc.
+ elements that are subject to upgrade process).
+
+.. <MT> So what you are saying is that the upgrade of any entity
+ (component, service) shouldn't cause even this much service
+ interruption. This was the reason I brought these figures here as well
+ that they are posing some kind of upper-upper boundary. Ideally the
+ interruption is in the millisecond range i.e. no more than a
+ switch-over or a live migration.
+
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade
+ of the NFVI should be in the sub-second range.
+
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,
+ i.e. HW, firmware, hypervisors, host OS etc.
\ No newline at end of file diff --git a/doc/03-Functional_Requirements.rst b/doc/03-Functional_Requirements.rst new file mode 100644 index 0000000..c0695bb --- /dev/null +++ b/doc/03-Functional_Requirements.rst @@ -0,0 +1,240 @@ +Functional Requirements +----------------------- + +Basic Actions +~~~~~~~~~~~~~ + +This section describes the basic functions may required by Escalator. + +Preparation (offline) +^^^^^^^^^^^^^^^^^^^^^ + +This is the design phase when the upgrade plan (or upgrade campaign) is +being designed so that it can be executed automatically with minimal +service outage. It may include the following work: + +1. Check the dependencies of the software modules and their impact, + backward compatibilities to figure out the appropriate upgrade method + and ordering. +2. Find out if a rolling upgrade could be planned with several rolling + steps to avoid any service outage due to the upgrade some + parts/services at the same time. +3. Collect the proper version files and check the integration for + upgrading. +4. The preparation step should produce an output (i.e. upgrade + campaign/plan), which is executable automatically in an NFV Framework + and which can be validated before execution. + + - The upgrade campaign should not be referring to scalable entities + directly, but allow for adaptation to the system configuration and + state at any given moment. + - The upgrade campaign should describe the ordering of the upgrade + of different entities so that dependencies, redundancies can be + maintained during the upgrade execution + - The upgrade campaign should provide information about the + applicable recovery procedures and their ordering. + - The upgrade campaign should consider information about the + verification/testing procedures to be performed during the upgrade + so that upgrade failures can be detected as soon as possible and + the appropriate recovery procedure can be identified and applied. + - The upgrade campaign should provide information on the expected + execution time so that hanging execution can be identified + - The upgrade campaign should indicate any point in the upgrade when + coordination with the users (VNFs) is required. + +.. <hujie> Depends on the attributes of the object being upgraded, the + upgrade plan may be slitted into step(s) and/or sub-plan(s), and even + more small sub-plans in design phase. The plan(s) or sub-plan(s) my + include step(s) or sub-plan(s). + +Validation the upgrade plan / Checking the pre-requisites of System( offline / online) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The upgrade plan should be validated before the execution by testing +it in a test environment which is similar to the product environment. + +.. <MT> However it could also mean that we can identify some properties + that it should satisfy e.g. what operations can or cannot be executed + simultaneously like never take out two VMs of the same VNF. + +.. <MT> Another question is if it requires that the system is in a particular + state when the upgrade is applied. I.e. if there's certain amount of + redundancy in the system, migration is enabled for VMs, when the NFVI + is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is + healthy, etc. + +.. <MT> I'm not sure what online validation means: Is it the validation of the + upgrade plan/campaign or the validation of the system that it is in a + state that the upgrade can be performed without too much risk?== + +Before the upgrade plan being executed, the system healthy of the +online product environment should be checked and confirmed to satisfy +the requirements which were described in the upgrade plan. The +sysinfo, e.g. which included system alarms, performance statistics and +diagnostic logs, will be collected and analogized. It is required to +resolve all of the system faults or exclude the unhealthy part before +executing the upgrade plan. + + +Backup/Snapshot (online) +^^^^^^^^^^^^^^^^^^^^^^^^ + +For avoid loss of data when a unsuccessful upgrade was encountered, the +data should be back-upped and the system state snapshot should be taken +before the execution of upgrade plan. This would be considered in the +upgrade plan. + +Several backups/Snapshots may be generated and stored before the single +steps of changes. The following data/files are required to be +considered: + +1. running version files for each node. +2. system components' configuration file and database. +3. image and storage, if it is necessary. + +.. <MT> Does 3 imply VNF image and storage? I.e. VNF state and data?== + +.. <hujie> The following text is derived from previous "4. Negotiate + with the VNF if it's ready for the upgrade" + +Although the upper layer, which include VNFs and VNFMs, is out of the +scope of Escalator, but it is still recommended to let it ready for a +smooth system upgrade. The escalator could not guarantee the safe of +VNFs. The upper layer should have some safe guard mechanism in design, +and ready for avoiding failure in system upgrade. + +Execution (online) +^^^^^^^^^^^^^^^^^^ + +The execution of upgrade plan should be a dynamical procedure which is + controlled by Escalator. + +.. <hujie> Revised text to be general.== + +1. It is required to supporting execution ether in sequence or in + parallel. +2. It is required to check the result of the execution and take the + action according the situation and the policies in the upgrade plan. +3. It is required to execute properly on various configurations of + system object. I.e. stand-alone, HA, etc. +4. It is required to execute on the designated different parts of the + system. I.e. physical server, virtualized server, rack, chassis, + cluster, even different geographical places. + +Testing (online) +^^^^^^^^^^^^^^^^ + +The testing after upgrade the whole system or parts of system to make +sure the upgraded system(object) is working normally. + +.. <hujie> Revised text to be general. + +1. It is recommended to run the prepared test cases to see if the + functionalities are available without any problem. +2. It is recommended to check the sysinfo, e.g. system alarms, + performance statistics and diagnostic logs to see if there are any + abnormal. + +Restore/Roll-back (online) +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +When upgrade is failure unfortunately, a quick system restore or system +roll-back should be taken to recovery the system and the services. + +.. <hujie> Revised text to be general. + +1. It is recommend to support system restore from backup when upgrade + was failed. +2. It is recommend to support graceful roll-back with reverse order + steps if possible. + +Monitoring (online) +^^^^^^^^^^^^^^^^^^^ + +Escalator should continually monitor the process of upgrade. It is +keeping update status of each module, each node, each cluster into a +status table during upgrade. + +.. <hujie> Revised text to be general. + +1. It is required to collect the status of every objects being upgraded + and sending abnormal alarms during the upgrade. +2. It is recommend to reuse the existing monitoring system, like alarm. +3. It is recommend to support pro-actively query. +4. It is recommend to support passively wait for notification. + +**Two possible ways for monitoring:** + +**Pro-Actively Query** requires NFVI/VIM provides proper API or CLI +interface. If Escalator serves as a service, it should pass on these +interfaces. + +**Passively Wait for Notification** requires Escalator provides +callback interface, which could be used by NFVI/VIM systems or upgrade +agent to send back notification. + +.. <hujie> I am not sure why not to subscribe the notification. + +Logging (online) +^^^^^^^^^^^^^^^^ + +Record the information generated by escalator into log files. The log +file is used for manual diagnostic of exceptions. + +1. It is required to support logging. +2. It is recommended to include time stamp, object id, action name, + error code, etc. + +Administrative Control (online) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Administrative Control is used for control the privilege to start any +escalator's actions for avoiding unauthorized operations. + +#. It is required to support administrative control mechanism +#. It is recommend to reuse the system's own secure system. +#. It is required to avoid conflicts when the system's own secure system + being upgraded. + +Requirements on Object being upgraded +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. <hujie> We can develop BPs in future from requirements of this section and + gap analysis for upper stream projects + +Escalator focus on smooth upgrade. In practical implementation, it +might be combined with installer/deplorer, or act as an independent +tool/service. In either way, it requires targeting systems(NFVI and +VIM) are developed/deployed in a way that Escalator could perform +upgrade on them. + +On NFVI system, live-migration is likely used to maintain availability +because OPNFV would like to make HA transparent from end user. This +requires VIM system being able to put compute node into maintenance mode +and then isolated from normal service. Otherwise, new NFVI instances +might risk at being schedule into the upgrading node. + +On VIM system, availability is likely achieved by redundancy. This +impose less requirements on system/services being upgrade (see PVA +comments in early version). However, there should be a way to put the +target system into standby mode. Because starting upgrade on the +master node in a cluster is likely a bad idea. + +.. <hujie>Revised text to be general. + +1. It is required for NFVI/VIM to support **service handover** mechanism + that minimize interruption to 0.001%(i.e. 99.999% service + availability). Possible implementations are live-migration, redundant + deployment, etc, (Note: for VIM, interruption could be less + restrictive) + +2. It is required for NFVI/VIM to restore the early version in a efficient + way, such as **snapshot**. + +3. It is required for NFVI/VIM to **migration data** efficiently between + base and upgraded system. + +4. It is recommend for NFV/VIM's interface to support upgrade + orchestration, e.g. reading/setting system state. + + diff --git a/doc/04-Use_Cases_and_Scenarios.rst b/doc/04-Use_Cases_and_Scenarios.rst new file mode 100644 index 0000000..13d16cf --- /dev/null +++ b/doc/04-Use_Cases_and_Scenarios.rst @@ -0,0 +1,32 @@ +Use Cases and Scenarios +----------------------- + +This section describes the use cases and scenarios to verify the +requirements of Escalator. + +Upgrade a system with minimal configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A minimal configuration system is normally deployed for experimental or +development usages, such as a OPNFV test bed. Although it dose not have +large workload, but it is a typical system to be upgraded frequently. + +Upgrade a system with HA configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A HA configuration system is very popular in the operator's data centre. +And it is a typical product environment. It always running 7 \* 24 a +week with VNFs running on it to provide services to the end users. + +Upgrade a system with Multi-Site configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Upgrade in one site may cause service interruption to other site, if +both sites are depended and sharing the same modules/data base (e.g. a +keystone for both sites). + +If a site failure during an upgrade, the roll-back missing any minimal +state/data loss can cause an affect/failure to the depended site. + +.. <hujie> Consider one site of ARNO release first. Then, multi-site + in the future.
\ No newline at end of file diff --git a/doc/05-Reference_Architecture.rst b/doc/05-Reference_Architecture.rst new file mode 100644 index 0000000..1b16dbe --- /dev/null +++ b/doc/05-Reference_Architecture.rst @@ -0,0 +1,6 @@ +Reference Architecture +---------------------- + +This section describes the reference architecture, the function blocks, +the function entities of Escalator for the reader to well understand how +the basic functions be organized.
\ No newline at end of file diff --git a/doc/06-Information_Flows.rst b/doc/06-Information_Flows.rst new file mode 100644 index 0000000..14f2908 --- /dev/null +++ b/doc/06-Information_Flows.rst @@ -0,0 +1,8 @@ +Information Flows +----------------- + +This section describes the information flows among the function +entities when Escalator is in actions. + +.. <hujie> We should consider a generic procedure / frameworks of upgrading. And + may provide plug-ins interface for specialized tasks
\ No newline at end of file diff --git a/doc/07-Interfaces_and_Files.rst b/doc/07-Interfaces_and_Files.rst new file mode 100644 index 0000000..87f916e --- /dev/null +++ b/doc/07-Interfaces_and_Files.rst @@ -0,0 +1,27 @@ +Interfaces and Files +-------------------- + +This section describes the required interfaces and files of Escalator. + + +CLI Interface +~~~~~~~~~~~~~~~~ + +This section describes CLI of Escalator. + +RESTful API +~~~~~~~~~~~ + +This section describes the API of Escalator for developer. + +Configuration File +~~~~~~~~~~~~~~~~~~ + +This section will suggest a format of the configuration files and how to +deal with it. + +Log File +~~~~~~~~ + +This section will suggest a format of the log files and how to deal with +it.
\ No newline at end of file diff --git a/doc/08-Requirements_from_other_OPNFV_Project.rst b/doc/08-Requirements_from_other_OPNFV_Project.rst new file mode 100644 index 0000000..62e611f --- /dev/null +++ b/doc/08-Requirements_from_other_OPNFV_Project.rst @@ -0,0 +1,40 @@ +Requirements from other OPNFV projects +-------------------------------------- + +We have created a questionnaire_ for collecting other projects requirements. +Please advertise it. + +.. _questionnaire: https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form + + + +Doctor Project +~~~~~~~~~~~~~~ + +.. <Malla> This scenario could be out of scope in Escalator project, but + having the option to support this should be better to align with + Doctor requirements. + +The scope of Doctor project also covers maintenance scenario in which + +1. The VIM administrator requests host maintenance to VIM. + +2. VIM will notify it to consumer such as VNFM to trigger application level + migration or switching active-standby nodes. + +3. VIM waits response from the consumer for a short while. + +- VIM should send out notification of VM migration to consumer (VNFM) + as abstracted message like "maintenance". + +- VIM could wait VM migration until it receives "VM ready to + maintenance" message from the owner (VNFM) + +HA Project +~~~~~~~~~~ + +Multi-site Project +~~~~~~~~~~~~~~~~~~ + +- Escalator upgrade one site should at least not lead to the other site + API token validation failed. diff --git a/doc/09-Reference.rst b/doc/09-Reference.rst new file mode 100644 index 0000000..0b5ff17 --- /dev/null +++ b/doc/09-Reference.rst @@ -0,0 +1,17 @@ +Reference +--------- + +[1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework” + +[2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV" + +[3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture” + +[4] ETSI GS NFV-MAN001:“Management and Orchestration” + +[5] ETSI GS NFV-REL001:"Resiliency Requirements" + +[6] QuEST Forum TL-9000:"Quality Management System Requirement +Handbook" + +[7] Service Availability Forum AIS:"Software Management Framework" diff --git a/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst b/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst new file mode 100644 index 0000000..5c2195b --- /dev/null +++ b/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst @@ -0,0 +1,11 @@ +Useful Working Drafts of ETSI NFV +--------------------------------- + +Access them with your own ETSI account, please DO NOT disclose the +content. + +[1] Migrate Virtualised Compute Resource operation @ 7.3.1.8 +ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip + +[2] Reliability issues during NFV Software upgrade and improvement mechanisms @ 8 +ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip diff --git a/doc/A1-Appendix.rst b/doc/A1-Appendix.rst new file mode 100644 index 0000000..85f0717 --- /dev/null +++ b/doc/A1-Appendix.rst @@ -0,0 +1,49 @@ +Appendix +-------- + +A.1 Impact Analysis +~~~~~~~~~~~~~~~~~~~ + +Upgrading the different software modules may cause different impact on +the availability of the infrastructure resources and even on the service +continuity of the vNFs. + +**Software modules in the computing nodes** + +#. Host OS patch + +#. Hypervisor, such as KVM, QEMU, XEN, libvirt +#. Openstack agent in computing nodes (like Nova agent, Ceilometer + agent...) + +.. <MT> As SW module, we should list the host OS and maybe its + drivers as well. From upgrade perspective do we limit host OS + upgrades to patches only? + +**Software modules in network nodes** + +#. Neutron L2/L3 agent +#. OVS, SR-IOV Driver + +**Software modules storage nodes** + +#. Ceph + +The table below analyses such an impact - considering a single instance +of each software module - from the following aspects: + +- the function which will be lost during upgrade, +- the duration of the loss of this specific function, +- if this causes the loss of the vNF function, +- if it causes incompatibility in the different parts of the software, +- what should be backed up before the upgrade, +- the duration of restoration time if the upgrade fails + +These values provided come from internal testing and based on some +assumptions, they may vary depending on the deployment techniques. +Please feel free to add if you find more efficient values during your +testing. + +https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx + +Note that no redundancy of the software modules is considered in the table. diff --git a/doc/Escalator_Requirement.rst b/doc/Escalator_Requirement.rst deleted file mode 100644 index e80a11d..0000000 --- a/doc/Escalator_Requirement.rst +++ /dev/null @@ -1,814 +0,0 @@ -Draft Escalator Requirement v0.4 -================================ - -Authors: --------- - -| Jie Hu (ZTE, hu.jie@zte.com.cn) -| Qiao Fu (China Mobile, fuqiao@chinamobile.com) -| Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com) -| Maria Toeroe (Ericsson, maria.toeroe@ericsson.com) -| Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com) -| Zhong Chao (ZTE, chao.zhong@zte.com.cn) -| Julien Zhang (ZTE, zhang.jun3g@zte.com.cn) -| Yuri Yuan (ZTE, yuan.yue@zte.com.cn) -| Zhipeng Huang (Huawei, huangzhipeng@huawei.com) -| Jia Meng (ZTE, meng.jia@zte.com.cn) -| Liyi Meng (Ericsson, liyi.meng@ericsson.com) -| Pasi Vaananen (Stratus, pasi.vaananen@stratus.com) - -1. Scope --------- - -| This document describes the user requirements on the smooth upgrade - function of the NFVI and VIM with respect to the upgrades of the OPNFV - platform from one version to another. Smooth upgrade means that the - upgrade results in no service outage for the end-users. This requires - that the process of the upgrade is automatically carried out by a tool - (code name: Escalator) with pre-configured data. The upgrade process - includes preparation, validation, execution, monitoring and - conclusion. -| ==[MT] While it is good to have a tool for the entire upgrade process, - but it is a challenging task, so maybe we shouldn't require automation - for the entire process right away. Automation is essential at - execution.== -| ==[hujie] Maybe we can analysis information flows of the upgrade tool, - abstract the basic / essential actions from the tool (or tools), and - map them to a command set of NFVI / VIM's interfaces.== - -The requirements are defined in a stepwise approach, i.e. in the first -phase focusing on the upgrade of the VIM then widening the scope to the -NFVI. - -The requirements may apply to different NFV functions (NFVI, or VIM, or -both of them) . They will be classified in the Appendix of this -document. - -2. General Requirements Background and terminology --------------------------------------------------- - -==[MT] At the moment 2.1-2.3 seem to be more background sections than -requirements. Should we rename this part?== - -2.1 Terminologies and definitions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -- **NFVI** is abbreviation for Network Function Virtualization - Infrastructure; sometimes it is also referred as data plane in this - document. -- **VIM** is abbreviation for Virtual Infrastructure Management; - sometimes it is also referred as control plane in this document. -- **Operators** are network service providers and Virtual Network - Function (VNF) providers. -- **End-Users** are subscribers of Operator's services. -- **Network Service** is a service provided by an Operator to its - End-users using a set of (virtualized) Network Functions -- **Infrastructure Services** are those provided by the NFV - Infrastructure and the Management & Orchestration functions to the - VNFs. I.e. these are the virtual resources as perceived by the VNFs. -- **Smooth Upgrade** means that the upgrade results in no service - outage for the end-users. -- **Rolling Upgrade** is an upgrade strategy that upgrades each node or - a subset of nodes in a wave rolling style through the data centre. It - is a popular upgrade strategy to maintains service availability. -- **Parallel Universe** is an upgrade strategy that creates and deploys - a new universe - a system with the new configuration - while the old - system continues running. The state of the old system is transferred - to the new system after sufficient testing of the later. -- **Infrastructure Resource Model** ==(suggested by MT)== is identified - as: physical resources, virtualization facility resources and virtual - resources. -- **Physical Resources** are the hardware of the infrastructure, may - also includes the firmware that enable the hardware. -- **Virtual Resources** are resources provided as services built on top - of the physical resources via the virtualization facilities; in our - case, they are the components that VNF entities are built on, e.g. - the VMs, virtual switches, virtual routers, virtual disks etc - ==[MT] I don't think the VNF is the virtual resource. Virtual - resources are the VMs, virtual switches, virtual routers, virtual - disks etc. The VNF uses them, but I don't think they are equal. The - VIM doesn't manage the VNF, but it does manage virtual resources.== -- **Visualization Facilities** are resources that enable the creation - of virtual environments on top of the physical resources, e.g. - hypervisor, OpenStack, etc. - -2.2 Upgrade Objects -~~~~~~~~~~~~~~~~~~~ - -2.2.1 Physical Resource -^^^^^^^^^^^^^^^^^^^^^^^ - -| Most of the cloud infrastructures support dynamic addition/removal of - hardware. A hardware upgrade could be done by removing the old - hardware node and adding the new one. This will not be in the scope of - this project. -| ==[MT] Does this mean that we are excluding firmware upgrades too?== - -2.2.2 Virtual Resources -^^^^^^^^^^^^^^^^^^^^^^^ - -| Virtual resource upgrade mainly done by users. OPNFV may facilitate - the activity, but suggest to have it in long term roadmap instead of - initiate release. -| ==[MT] same comment here: I don't think the VNF is the virtual - resource. Virtual resources are the VMs, virtual switches, virtual - routers, virtual disks etc. The VNF uses them, but I don't think they - are equal. For example if by some reason the hypervisor is changed and - the current VMs cannot be migrated to the new hypervisor, they are - incompatible, then the VMs need to be upgraded too. This is not - something the NFVI user (i.e. VNFs ) would even know about.== - -2.2.3 Virtualization Facility Resources -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| Based on the functionality they provide, virtualization facility - resources could be divided into computing node, networking node, - storage node and management node. -| The possible upgrade objects in these nodes are addressed below: - (Note: hardware based virtualization may considered as virtualization - facility resource, but from escalator perspective, it is better - considered it as part of hardware upgrade. ) - -**Computing node** - -#. OS Kernel -#. Hypvervisor and virtual switch -#. Other kernel modules, like driver -#. User space software packages, like nova-compute agents and other - control plane programs - -| Updating 1 and 2 will cause the loss of virtualzation functionality of - the compute node, which may lead to data plane services interruption - if the virtual resource is not redudant. -| Updating 3 might result the same. -| Updating 4 might lead to control plane services interruption if not an - HA deployment. - -**Networking node** - -#. OS kernel, optional, not all switch/router allow you to upgrade its - OS since it is more like a firmware than a generic OS. -#. User space software package, like neutron agents and other control - plane programs - -| Updating 1 if allowed will cause a node reboot and therefore leads to - data plane services interruption if the virtual resource is not - redudant. -| Updating 2 might lead to control plane services interruption if not an - HA deployment. - -**Storage node** - -#. OS kernel, optional, not all storage node allow you to upgrade its OS - since it is more like a firmware than a generic OS. -#. Kernel modules -#. User space software packages, control plane programs - -| Updating 1 if allowed will cause a node reboot and therefore leads to - data plane services interruption if the virtual resource is not - redudant. -| Update 2 might result in the same. -| Updating 3 might lead to control plane services interruption if not an - HA deployment. - -**Management node** - -#. OS Kernel -#. Kernel modules, like driver -#. User space software packages, like database, message queue and - control plane programs. - -| Updating 1 will cause a node reboot and therefore leads to control - plane services interruption if not an HA deployment. Updating 2 might - result in the same. -| Updating 3 might lead to control plane services interruption if not an - HA deployment. - -2.3 Upgrade Span -~~~~~~~~~~~~~~~~ - -| **Major Upgrade** -| Upgrades between major releases may introducing significent changes in - function, configuration and data, such as the upgrade of OPNFV from - Arno to Brahmaputra. - -| **Minor Upgrade** -| Upgrades inside one major releases which would not leads to changing - the stucture of the platform and may not infect the schema of the - system data. - -2.4 Upgrade Granularity -~~~~~~~~~~~~~~~~~~~~~~~ - -2.4.1 Physical/Hardware Dimension -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Support full / partial upgrade for data centre, cluster, zone. Because -of the upgrade of a data centre or a zone, it may be divided into -several batches. The upgrade of a cloud environment (cluster) may also -be partial. For example, in one cloud environment running a number of -VNFs, we may just try one of them to check the stability and -performance, before we upgrade all of them. - -2.4.2 Software Dimension -^^^^^^^^^^^^^^^^^^^^^^^^ - -- The upgrade of host OS or kernel may need a 'hot migration' -- The upgrade of OpenStack’s components - i.the one-shot upgrade of all components - ii.the partial upgrade (or bugfix patch) which only affects some - components (e.g., computing, storage, network, database, message - queue, etc.) - -| ==[MT] this section seems to overlap with 2.1.== -| I can see the following dimensions for the software - -- different software packages -- different funtions - Considering that the target versions of all - software are compatible the upgrade needs to ensure that any - dependencies between SW and therefore packages are taken into account - in the upgrade plan, i.e. no version mismatch occurs during the - upgrade therefore dependencies are not broken -- same function - This is an upgrade specific question if different - versions can coexist in the system when a SW is being upgraded from - one version to another. This is particularly important for stateful - functions e.g. storage, networking, control services. The upgrade - method must consider the compatibility of the redundant entities. - -- different versions of the same software package -- major version changes - they may introduce incompatibilities. Even - when there are backward compatibility requirements changes may cause - issues at graceful rollback -- minor version changes - they must not introduce incompatibility - between versions, these should be primarily bug fixes, so live - patches should be possible - -- different installations of the same software package -- using different installation options - they may reflect different - users with different needs so redundancy issues are less likely - between installations of different options; but they could be the - reflection of the heterogeneous system in which case they may provide - redundancy for higher availability, i.e. deeper inspection is needed -- using the same installation options - they often reflect that the are - used by redundant entities across space - -- different distribution possibilities in space - same or different - availability zones, multi-site, geo-redundancy - -- different entities running from the same installation of a software - package -- using different startup options - they may reflect different users so - redundancy may not be an issues between them -- using same startup options - they often reflect redundant - entities==== - -3.5 Upgrade duration -~~~~~~~~~~~~~~~~~~~~ - -As the OPNFV end-users are primarily Telco operators, the network -services provided by the VNFs deployed on the NFVI should meet the -requirement of 'Carrier Grade'. - -In telecommunication, a "carrier grade" or"carrier class" refers to a -system, or a hardware or software component that is extremely reliable, -well tested and proven in its capabilities. Carrier grade systems are -tested and engineered to meet or exceed "five nines" high availability -standards, and provide very fast fault recovery through redundancy -(normally less than 50 milliseconds). [from wikipedia.org] - -"five nines" means working all the time in ONE YEAR except 5'15". - -We have learnt that a well prepared upgrade of OpenStack needs 10 -minutes. The major time slot in the outage time is used spent on -synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done! -' by Symantec] - -This 10 minutes of downtime of OpenStack however did not impact the -users, i.e. the VMs running on the compute nodes. This was the outage of -the control plane only. On the other hand with respect to the -preparations this was a manually tailored upgrade specific to the -particular deployment and the versions of each OpenStack service. - -The project targets to achieve a more generic methodology, which however -requires that the upgrade objects fulfill ceratin requirements. Since -this is only possible on the long run we target first upgrades from -version to version for the different VIM services. - -**Questions:** - -#. | Can we manage to upgrade OPNFV in only 5 minutes? - | ==[MT] The first question is whether we have the same carrier grade - requirement on the control plane as on the user plane. I.e. how - much control plane outage we can/willing to tolerate? - | In the above case probably if the database is only half of the size - we can do the upgrade in 5 minutes, but is that good? It also means - that if the database is twice as much then the outage is 20 - minutes. - | For the user plane we should go for less as with two release yearly - that means 10 minutes outage per year.== - | ==[Malla] 10 minutes outage per year to the users? Plus, if we take - control plane into the consideration, then total outage will be - more than 10 minute in whole network, right?== - | ==[MT] The control plane outage does not have to cause outage to - the users, but it may of course depending on the size of the system - as it's more likely that there's a failure that needs to be handled - by the control plane.== - -#. | Is it acceptable for end users ? Such as a planed service - interruption will lasting more than ten minutes for software - upgrade. - | ==[MT] For user plane, no it's not acceptable in case of - carrier-grade. The 5' 15" downtime should include unplanned and - planned downtimes.== - | ==[Malla] I go agree with Maria, it is not acceptable.== - -#. | Will any VNFs still working well when VIM is down? - | ==[MT] In case of OpenStack it seems yes. .:)== - -2.5.1 The maximum duration of an upgrade -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| The duration of an upgrade is related to and proportional with the - scale and the complexity of the OPNFV platform as well as the - granularity (in function and in space) of the upgrade. -| [Malla] Also, if is a partial upgrade like module upgrade, it depends - also on the OPNFV modules and their tight connection entites as well. - -2.5.2 The maximum duration of a rollback when an upgrade is failed - this should be about rollback duration -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| The duration of a rollback is short than the corresponding upgrade. It - depends on the duration of restore the software and configue data from - pre-upgrade backup / snapshot. -| ==[MT] During the upgrade process two types of failure may happen: -| In case we can recover from the failure by undoing the upgrade - actions it is possible to roll back the already executed part of the - upgrade in graceful manner introducing no more service outage than - what was introduced during the upgrade. Such a graceful rollback - requires typically the same amount of time as the executed portion of - the upgrade and impose minimal state/data loss.== -| ==[MT] Requirement: It should be possible to roll back gracefully the - failed upgrade of stateful services of the control plane. -| In case we cannot recover from the failure by just undoing the - upgrade actions, we have to restore the upgraded entities from their - backed up state. In other terms the system falls back to an earlier - state, which is typically a faster recovery procedure than graceful - rollback and depending on the statefulness of the entities involved it - may result in significant state/data loss.== -| **Two possible types of failures can happen during an upgrade** - -#. We can recover from the failure that occured in the upgrade process: - In this case, a graceful rolling back of the executed part of the - upgrade may be possible which would "undo" the executed part in a - similar fashion. Thus, such a roll back introduces no more service - outage during an upgrade than the executed part introduced. This - process typically requires the same amount of time as the executed - portion of the upgrade and impose minimal state/data loss. -#. We cannot recover from the failure that occured in the upgrade - process: In this case, the system needs to fall back to an earlier - consistent state by reloading this backed-up state. This is typically - a faster recovery procedure than the graceful rollback, but can cause - state/data loss. The state/data loss usually depends on the - statefulness of the entities whose state is restored from the backup. - -2.5.3 The maximum duration of a VNF interruption -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| Since not the entire process of a smooth upgrade will affect the VNFs, - the duration of the VNF interruption may be shorter than the duration - of the upgrade. In some cases, the VNF running without the control - from of the VIM is acceptable. -| ==[MT] Should require explicitly that the NFVI should be able to - provide its services to the VNFs independent of the control plane?== -| ==[MT] Requirement: The upgrade of the control plane must not cause - interruption of the NFVI services provided to the VNFs.== -| ==[MT] With respect to carrier-grade the yearly service outage of the - VNF should not exceed 5' 15" regardless whether it is planned or - unplanned outage. Considering the HA requirements TL-9000 requires an - ent-to-end service recovery time of 15 seconds based on which the ETSI - GS NFV-REL 001 V1.1.1 (2015-01) document defines three service - availability levels (SAL). The proposed example service recovery times - for these levels are: -| SAL1: 5-6 seconds -| SAL2: 10-15 seconds -| SAL3: 20-25 seconds== -| ==[Pva] my comment was actually that the downtime metrics of the - underlying elements, components and services are small fraction of the - total E2E service availability time. No-one on the E2E service path - will get the whole downtime allocation (in this context it includes - upgrade process related outages for the services provided by VIM etc. - elements that are subject to upgrade process).== -| ==[MT] So what you are saying is that the upgrade of any entity - (component, service) shouldn't cause even this much service - interruption. This was the reason I brought these figures here as well - that they are posing some kind of upper-upper boundary. Ideally the - interruption is in the millisecond range i.e. no more than a - switchover or a live migration.== -| ==[MT] Requirement: Any interruption caused to the VNF by the upgrade - of the NFVI should be in the sub-second range.== - -==[MT] In the future we also need to consider the upgrade of the NFVI, -i.e. HW, firmware, hypervisors, host OS etc.== - -3. Functional Considerations ----------------------------- - -3.1 Requirement of Escalator's Basic Actions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This section describes the basic functions may required by Escalator. - -3.1.1 Preparation (offline) -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This is the design phase when the upgrade plan (or upgrade campaign) is -being designed so that it can be executed automatically with minimal -service outage. It may include the following work: - -#. Check the dependencies of the software modules and their impact, - backward compatibilities to figure out the appropriate upgrade method - and ordering. -#. Find out if a rolling upgrade could be planned with several rolling - steps to avoid any service outage due to the upgrade some - parts/services at the same time. -#. Collect the proper version files and check the integration for - upgrading. -#. The preparation step should produce an output (i.e. upgrade - campaign/plan), which is executable automatically in an NFV Famawork - and which can be validated before execution. - - - The upgrade campaign should not be referring to scalable entities - directly, but allow for adaptation to the system configuration and - state at any given moment. - - The upgrade campaign should describe the ordering of the upgrade - of different entities so that dependencies, redundancies can be - maintained during the upgrade execution - - The upgrade campaign should provide information about the - applicable recovery procedures and their ordering. - - The upgrade campaign should consider information about the - verification/testing procedures to be performed during the upgrade - so that upgrade failures can be detected as soon as possible and - the appropriate recovery procedure can be identified and applied. - - The upgrade campaign should provide information on the expected - execution time so that hanging execution can be identified - - The upgrade campaign should indicate any point in the upgrade when - coordination with the users (VNFs) is required. - -==[hujie]Depends on the attributes of the object being upgraded, the -upgrade plan may be slitted into step(s) and/or sub-plan(s), and even -more small sub-plans in design phase. The plan(s) or sub-plan(s) my -include step(s) or sub-plan(s).== - -3.1.2 Validation the upgrade plan / Checking the pre-requisites of System( offline / online) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| The upgrade plan should be validated before the execution by testing - it in a test environment which is similar to the product environment. -| ==[MT]However it could also mean that we can identify some properties - that it should satisfy e.g. what operations can or cannot be executed - simultaneously like never take out two VMs of the same VNF. -| Another question is if it requires that the system is in a particular - state when the upgrade is applied. I.e. if there's certain amount of - redundacy in the system, migration is enabled for VMs, when the NFVI - is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is - healthy, etc. -| I'm not sure what online validation means: Is it the validation of the - upgrade plan/campaign or the validation of the system that it is in a - state that the upgrade can be performed without too much risk?== - -| Before the upgrade plan being executed, the system heathly of the - online product environment should be checked and confirmed to satisfy - the requirements which were described in the upgrade plan. The - sysinfo, e.g. which included system alarms, performance statistics and - diagnostic logs, will be collected and analyized. It is required to - resolve all of the system faults or exclud the unhealthy part before - executing the upgrade plan. -| ==[hujie] Text merged.== - -3.1.3 Backup/Snapshot (online) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -For avoid loss of data when a unsuccessful upgrade was encountered, the -data should be backuped and the system state snapshot should be taken -before the excution of upgrade plan. This would be considered in the -upgrade plan. - -Several backups/Snapshots may be generated and stored before the single -steps of changes. The following data/files are required to be -considered: - -#. running version files for each node. -#. system components' configuration file and database. -#. image and storage, if it is necessary. - ==[MT] Does 3 imply VNF image and storage? I.e. VNF state and data?== - -| ==[hujie] The following text is derived from previous "4. Negotiate - with the VNF if it's ready for the upgrade"== -| Although the upper layer, which include VNFs and VNFMs, is out of the - scope of Escalator, but it is still recommended to let it ready for a - smooth system upgrade. The escalator could not garanttee the safe of - VNFs. The upper layer should have some safe guard mechanism in design, - and ready for avoiding failure in system upgrade. - -3.1.4 Execution (online) -^^^^^^^^^^^^^^^^^^^^^^^^ - -| The execution of upgrade plan should be a dynamical procedure which is - controlled by Escalator. -| ==[hujie] Revised text to be general.== - -#. It is required to supporting execution ether in sequence or in - parallel. -#. It is required to checke the result of the execution and take the - action according the situation and the policies in the upgrade plan. -#. It is required to execute properly on various configurations of - system object. I.e. stand-alone, HA, etc. -#. It is required to excecute on the designated different parts of the - system. I.e. physical server, virtualized server, rack, chassis, - cluster, even different geographical places. - -3.1.5 Testing (online) -^^^^^^^^^^^^^^^^^^^^^^ - -| The testing after upgrade the whole system or parts of system to make - sure the upgraded system(object) is working normally. -| ==[hujie] Revised text to be general.== - -#. It is recommended to run the prepared test cases to see if the - functionalities are availiable without any problem. -#. It is recommended to check the sysinfo, e.g. system alarms, - performance statistics and diagnostic logs to see if there are any - abnormal. - -3.1.6 Restore/Rollback (online) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -| When upgrade is failure unfortunatly, a quick system restore or system - rollback should be taken to recovery the system and the services. -| ==[hujie] Revised text to be general.== - -#. It is recommend to support system restore from backup when upgrade - was failed. -#. It is recommend to support gracefull rollback with reverse order - steps if possible. - -3.1.7 Monitoring (online) -^^^^^^^^^^^^^^^^^^^^^^^^^ - -| Escalator should continually monitor the process of upgrade. It is - keeping update status of each module, each node, each cluster into a - status table during upgrade. -| ==[hujie] Revised text to be general.== - -#. It is required to collect the status of every objects being upgraded - and sending abnormal alerms during the upgrade. -#. It is recommend to reuse the existing monitoring system, like alarm. -#. It is recommend to support pro-actively query. -#. It is recommend to support passively wait for notification. - -| **Two possible ways for monitoring:** -| **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI - interface. If Escalator serves as a service, it should pass on these - interfaces. -| **Passively Wait for Notification** requires Escalator provides - callback interface, which could be used by NFVI/VIM systems or upgrade - agent to send back notification. -| [hujie] I am not sure why not to subscribe the notification. - -3.1.8 Logging (online) -^^^^^^^^^^^^^^^^^^^^^^ - -Record the information generated by escalator into log files. The log -file is used for manual diagnostic of exceptions. - -#. It is required to support logging. -#. It is recommended to include time stamp, object id, action name, - error code, etc. - -3.1.9 Administrative Control (online) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Administrative Control is used for control the privilege to start any -escalator's actions for avoding unauthorized operations. - -#. It is required to support administrative control mechenism -#. It is recommed to reuse the system's own secure system. -#. It is required to avoid conflicts when the system's own secure system - being upgraded. - -3.2 Requirements on system object being upgraded -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -| ==We can develope BPs in future from req of this section and GA for - upper stream projects== -| Escalator focus on smooth upgrade. In practical implementation, it - might be combined with installer/deployer, or act as an independent - tool/service. In either way, it requires targeting systems(NFVI and - VIM) are developed/deployed in a way that Escalator could perform - upgrade on them. - -On NFVI system, live-migration is likely used to maintain availability -because OPNFV would like to make HA transparent from end user. This -requires VIM system being able to put compute node into maintenance mode -and then isolated from normal service. Otherwise, new NFVI instances -might risk at being schedule into the upgrading node. - -| On VIM system, availability is likely achieved by redundancy. This - impose less requirements on system/services being upgrade (see PVA - comments in early version). However, there should be a way to put the - target system into standby mode. Because starting upgrade on the - master node in a cluster is likely a bad idea. -| ==[hujie] Revised text to be general.== - -#. It is required for NFVI/VIM to support **service handover** mechanism - that minimize interruption to 0.001%(i.e. 99.999% service - availability). Possible implementations are live-migration, redundant - deployment, etc, (Note: for VIM, interruption could be less - restrictive) -#. It is required for NFVI/VIM to restore the early verion in a efficent - way, such as **snapshot**. -#. It is required for NFVI/VIM to **migration data** efficiently between - base and upgraded system. - ==[hujie] What is exact meaning of "base" here?== -#. It is recomend for NFV/VIM's interface to support upgrade - orchestration, e.g. reading/setting system state - ==[hujie] I am not sure if it reflect the previous text.== - -4. Use Cases ------------- - -This section describes the use cases to verify the requirements of -Escalator. - -4.1 Upgrade a system with minimal configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -A minimal configuration system is normally depolyed for experimental or -developement ussage, such as a OPNFV test bed. Althouth it dose not have -large workload, but it is a typical system to be upgraded frequently. - -4.2 Upgrade a system with HA configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -A HA configuration system is very popular in the operator's data centre. -And it is a typical product environment. It always running 7 \* 24 a -week with VNFs running on it to provide services to the end users. - -4.3 Upgrade a system with Multi-Site configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Upgrade in one site may cause service interruption to other site, if -both sites are depended and sharing the same modules/data base (e.g. a -keystone for both sites). - -If a site failure during an upgrade, the rollback missing any minimal -state/data loss can cause an affect/failure to the depended site. - -==Consider one site of ARNO release first. Then, multi-site in the -future.== - -5. RA of Escalator ------------------- - -This section describes the reference architecture, the function blocks, -the function entities of Escalator for the reader to well understand how -the basic functions be organized. - -6. Information Flows --------------------- - -| This section describes the information flows among the function - entities when Escalator is in actions. -| We should consider a generic procedure / frameworks of upgrading. And - may provide a plugin interface for specialized tasks - -7. Interfaces -------------- - -This section describes the required interfaces of Escalator. - -7.1 Manual Interface (CLI / GUI) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -7.2 RESTful API -~~~~~~~~~~~~~~~ - -To support 3.3 Negotiate with the VNF if it's ready for the upgrade - -7.3 Configuration File -~~~~~~~~~~~~~~~~~~~~~~ - -This section will suggest a format of the configuration files and how to -deal with it. - -7.4 Log File ------------- - -This section will suggest a format of the log files and how to deal with -it. - -8. Requirements from other OPNFV projects ------------------------------------------ - -| We have created a questionnaire for collecting other projects - requirments - (https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form), - please advertise it. -| ==[hujie] Can we force other OPNFV projects to complete the survey by - using JIRA dependence?== - -8.1 Doctor Project -~~~~~~~~~~~~~~~~~~ - -| ==Note: This scenario could be out of scope in Escalator project, but - having the option to support this should be better to align with - Doctor requirements.== -| The scope of Doctor project also covers maintenance scenario in which - 1) the VIM administorator requests host maintenance to VIM, 2) VIM - will notifiy it to consumer such as VNFM to trigger application level - migration or switching active-standby nodes, and 3) VIM waits responce - from the consumer for a short while. - -- VIM should send out notification of VM migration to consumer (VNFM) - as abstracted message like "maintenance". -- VIM could wait VM migration until it receives "VM ready to - maintenance" message from the owner (VNFM) - -8.2 HA Project -~~~~~~~~~~~~~~ - -8.3 Multi-site Project -~~~~~~~~~~~~~~~~~~~~~~ - -- Escalator upgrade one site should at least not lead to the other site - API token validation failed. - -9. Reference ------------- - -| [1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework” -| [2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV". -| [3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture” -| [4] ETSI GS NFV-MAN001:“Management and Orchestration” -| [5] ETSI GS NFV-REL001:"Resiliency Requirements" -| [6] QuEST Forum TL-9000:"Quality Management System Requirement - Handbook" -| [7] Service Availabilty Forum AIS:"Software Management Framework" - -10. Useful Working Drafts of ETSI NFV -------------------------------------- - -| Access them with your own ETSI account, please DO NOT disclose the - content. -| [1] Migrate Virtualised Compute Resource operation @ 7.3.1.8 -| ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip -| [2] Reliability issues during NFV Software upgrade and improvement - mechanisms @ 8 -| ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip - -Appendix --------- - -A.1 Impact Analysis -~~~~~~~~~~~~~~~~~~~ - -Upgrading the different software modules may cause different impact on -the availability of the infrastracture resources and even on the service -continuity of the vNFs. - -**Software modules in the computing nodes** - -#. Host OS patch - ==[MT] As SW module, we should list the host OS and maybe ====its - drivers as well. From upgrade persepctive do we limit host OS - upgrades to patches only?== -#. Hypervisor, such as KVM, QEMU, XEN, libvirt -#. Openstack agent in computing nodes (like Nova agent, Ceilometer - agent...) - -**Software modules in network nodes** - -#. Neutron L2/L3 agent -#. OVS, SR-IOV Driver - -**Software modules storage nodes** - -#. Ceph - -The table below analyses such an impact - considering a single instance -of each software module - from the following aspects: - -- the function which will be lost during upgrade, -- the duration of the loss of this specific function, -- if this causes the loss of the vNF function, -- if it causes incompatibility in the different parts of the software, -- what should be backed up before the upgrade, -- the duration of restoration time if the upgrade fails - -| These values provided come from internal testing and based on some - assumptions, they may vary depending on the deployment techniques. - Please feel free to add if you find more efficient values during your - testing. -| https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx -| Note that no redundancy of the software modules is considered in the - table. |