|author||Jie Hu <firstname.lastname@example.org>||2015-12-16 18:52:19 +0800|
|committer||Jie Hu <email@example.com>||2015-12-16 19:11:03 +0800|
diff --git a/docs/requirements/104-Requirements.rst b/docs/requirements/104-Requirements.rst
new file mode 100644
@@ -0,0 +1,478 @@
+As the OPNFV end-users are primarily Telecom operators, the network
+services provided by the VNFs deployed on the NFVI should meet the
+requirement of 'Carrier Grade'.::
+ In telecommunication, a "carrier grade" or"carrier class" refers to a
+ system, or a hardware or software component that is extremely reliable,
+ well tested and proven in its capabilities. Carrier grade systems are
+ tested and engineered to meet or exceed "five nines" high availability
+ standards, and provide very fast fault recovery through redundancy
+ (normally less than 50 milliseconds). [from wikipedia.org]
+"five nines" means working all the time in ONE YEAR except 5'15".
+ We have learnt that a well prepared upgrade of OpenStack needs 10
+ minutes. The major time slot in the outage time is used spent on
+ synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
+ ' by Symantec]
+This 10 minutes of downtime of the OpenStack services however did not impact the
+users, i.e. the VMs running on the compute nodes. This was the outage of
+the control plane only. On the other hand with respect to the
+preparations this was a manually tailored upgrade specific to the
+particular deployment and the versions of each OpenStack service.
+The project targets to achieve a more generic methodology, which however
+requires that the upgrade objects fulfil certain requirements. Since
+this is only possible on the long run we target first the upgrade
+of the different VIM services from version to version.
+1. Can we manage to upgrade OPNFV in only 5 minutes?
+.. <MT> The first question is whether we have the same carrier grade
+ requirement on the control plane as on the user plane. I.e. how
+ much control plane outage we can/willing to tolerate?
+ In the above case probably if the database is only half of the size
+ we can do the upgrade in 5 minutes, but is that good? It also means
+ that if the database is twice as much then the outage is 20
+ For the user plane we should go for less as with two release yearly
+ that means 10 minutes outage per year.
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take
+ control plane into the consideration, then total outage will be
+ more than 10 minute in whole network, right?
+.. <MT> The control plane outage does not have to cause outage to
+ the users, but it may of course depending on the size of the system
+ as it's more likely that there's a failure that needs to be handled
+ by the control plane.
+2. Is it acceptable for end users ? Such as a planed service
+ interruption will lasting more than ten minutes for software
+.. <MT> For user plane, no it's not acceptable in case of
+ carrier-grade. The 5' 15" downtime should include unplanned and
+ planned downtimes.
+.. <Malla> I go agree with Maria, it is not acceptable.
+3. Will any VNFs still working well when VIM is down?
+.. <MT> In case of OpenStack it seems yes. .:)
+The maximum duration of an upgrade
+The duration of an upgrade is related to and proportional with the
+scale and the complexity of the OPNFV platform as well as the
+granularity (in function and in space) of the upgrade.
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends
+ also on the OPNFV modules and their tight connection entities as well.
+.. <MT> Since the maintenance window is shrinking and becoming non-existent
+ the duration of the upgrade is secondary to the requirement of smooth upgrade.
+ But probably we want to be able to put a time constraint on each upgrade
+ during which it must complete otherwise it is considered failed and the system
+ should be rolled back. I.e. in case of automatic execution it might not be clear
+ if an upgrade is long or just hanging. The time constraints may be a function
+ of the size of the system in terms of the upgrade object(s).
+The maximum duration of a roll back when an upgrade is failed
+The duration of a roll back is short than the corresponding upgrade. It
+depends on the duration of restore the software and configure data from
+pre-upgrade backup / snapshot.
+.. <MT> During the upgrade process two types of failure may happen:
+ In case we can recover from the failure by undoing the upgrade
+ actions it is possible to roll back the already executed part of the
+ upgrade in graceful manner introducing no more service outage than
+ what was introduced during the upgrade. Such a graceful roll back
+ requires typically the same amount of time as the executed portion of
+ the upgrade and impose minimal state/data loss.
+.. <MT> Requirement: It should be possible to roll back gracefully the
+ failed upgrade of stateful services of the control plane.
+ In case we cannot recover from the failure by just undoing the
+ upgrade actions, we have to restore the upgraded entities from their
+ backed up state. In other terms the system falls back to an earlier
+ state, which is typically a faster recovery procedure than graceful
+ roll back and depending on the statefulness of the entities involved it
+ may result in significant state/data loss.
+.. <MT> Two possible types of failures can happen during an upgrade
+.. <MT> We can recover from the failure that occurred in the upgrade process:
+ In this case, a graceful rolling back of the executed part of the
+ upgrade may be possible which would "undo" the executed part in a
+ similar fashion. Thus, such a roll back introduces no more service
+ outage during an upgrade than the executed part introduced. This
+ process typically requires the same amount of time as the executed
+ portion of the upgrade and impose minimal state/data loss.
+.. <MT> We cannot recover from the failure that occurred in the upgrade
+ process: In this case, the system needs to fall back to an earlier
+ consistent state by reloading this backed-up state. This is typically
+ a faster recovery procedure than the graceful roll back, but can cause
+ state/data loss. The state/data loss usually depends on the
+ statefulness of the entities whose state is restored from the backup.
+The maximum duration of a VNF interruption (Service outage)
+Since not the entire process of a smooth upgrade will affect the VNFs,
+the duration of the VNF interruption may be shorter than the duration
+of the upgrade. In some cases, the VNF running without the control
+from of the VIM is acceptable.
+.. <MT> Should require explicitly that the NFVI should be able to
+ provide its services to the VNFs independent of the control plane?
+.. <MT> Requirement: The upgrade of the control plane must not cause
+ interruption of the NFVI services provided to the VNFs.
+.. <MT> With respect to carrier-grade the yearly service outage of the
+ VNF should not exceed 5' 15" regardless whether it is planned or
+ unplanned outage. Considering the HA requirements TL-9000 requires an
+ end-to-end service recovery time of 15 seconds based on which the ETSI
+ GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
+ availability levels (SAL). The proposed example service recovery times
+ for these levels are:
+.. <MT> SAL1: 5-6 seconds
+.. <MT> SAL2: 10-15 seconds
+.. <MT> SAL3: 20-25 seconds
+.. <Pva> my comment was actually that the downtime metrics of the
+ underlying elements, components and services are small fraction of the
+ total E2E service availability time. No-one on the E2E service path
+ will get the whole downtime allocation (in this context it includes
+ upgrade process related outages for the services provided by VIM etc.
+ elements that are subject to upgrade process).
+.. <MT> So what you are saying is that the upgrade of any entity
+ (component, service) shouldn't cause even this much service
+ interruption. This was the reason I brought these figures here as well
+ that they are posing some kind of upper-upper boundary. Ideally the
+ interruption is in the millisecond range i.e. no more than a
+ switch-over or a live migration.
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade
+ of the NFVI should be in the sub-second range.
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,
+ i.e. HW, firmware, hypervisors, host OS etc.
+System is running normally. If there are any faults before the upgrade,
+it is difficult to distinguish between upgrade introduced and the environment
+The environment should have the redundant resources. Because the upgrade
+process is based on the business migration, in the absence of resource
+redundancy,it is impossible to realize the business migration, as well as to
+achieve a smooth upgrade.
+Resource redundancy in two levels:
+NFVI level: This level is mainly the compute nodes resource redundancy.
+During the upgrade, the virtual machine on business can be migrated to another
+free compute node.
+VNF level: This level depends on HA mechanism in VNF, such as:
+active-standby, load balance. In this case, as long as business of the target
+node on VMs is migrated to other free nodes, the migration of VM might not be
+The way of redundancy to be used is subject to the specific environment.
+Generally speaking, During the upgrade, the VNF's service level availability
+mechanism should be used in higher priority than the NFVI's. This will help
+us to reduce the service outage.
+Release version of software components
+This is primarily a compatibility requirement. You can refer to Linux/Python
+Compatible Semantic Versioning 3.0.0:
+Given a version number MAJOR.MINOR.PATCH, increment the:
+MAJOR version when you make incompatible API changes,
+MINOR version when you add functionality in a backwards-compatible manner,
+PATCH version when you make backwards-compatible bug fixes.
+Some internal interfaces of OpenStack will be used by Escalator indirectly,
+such as VM migration related interface between VIM and NFVI. So it is required
+to be backward compatible on these interfaces. Refer to "Interface" chapter
+Describes the different types of requirements. To have a table to label the source of
+the requirements, e.g. Doctor, Multi-site, etc.
+This section describes the basic functions may required by Escalator.
+This is the design phase when the upgrade plan (or upgrade campaign) is
+being designed so that it can be executed automatically with minimal
+service outage. It may include the following work:
+1. Check the dependencies of the software modules and their impact,
+ backward compatibilities to figure out the appropriate upgrade method
+ and ordering.
+2. Find out if a rolling upgrade could be planned with several rolling
+ steps to avoid any service outage due to the upgrade some
+ parts/services at the same time.
+3. Collect the proper version files and check the integration for
+4. The preparation step should produce an output (i.e. upgrade
+ campaign/plan), which is executable automatically in an NFV Framework
+ and which can be validated before execution.
+ - The upgrade campaign should not be referring to scalable entities
+ directly, but allow for adaptation to the system configuration and
+ state at any given moment.
+ - The upgrade campaign should describe the ordering of the upgrade
+ of different entities so that dependencies, redundancies can be
+ maintained during the upgrade execution
+ - The upgrade campaign should provide information about the
+ applicable recovery procedures and their ordering.
+ - The upgrade campaign should consider information about the
+ verification/testing procedures to be performed during the upgrade
+ so that upgrade failures can be detected as soon as possible and
+ the appropriate recovery procedure can be identified and applied.
+ - The upgrade campaign should provide information on the expected
+ execution time so that hanging execution can be identified
+ - The upgrade campaign should indicate any point in the upgrade when
+ coordination with the users (VNFs) is required.
+.. <hujie> Depends on the attributes of the object being upgraded, the
+ upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
+ more small sub-plans in design phase. The plan(s) or sub-plan(s) my
+ include step(s) or sub-plan(s).
+Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
+The upgrade plan should be validated before the execution by testing
+it in a test environment which is similar to the product environment.
+.. <MT> However it could also mean that we can identify some properties
+ that it should satisfy e.g. what operations can or cannot be executed
+ simultaneously like never take out two VMs of the same VNF.
+.. <MT> Another question is if it requires that the system is in a particular
+ state when the upgrade is applied. I.e. if there's certain amount of
+ redundancy in the system, migration is enabled for VMs, when the NFVI
+ is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
+ healthy, etc.
+.. <MT> I'm not sure what online validation means: Is it the validation of the
+ upgrade plan/campaign or the validation of the system that it is in a
+ state that the upgrade can be performed without too much risk?==
+Before the upgrade plan being executed, the system healthy of the
+online product environment should be checked and confirmed to satisfy
+the requirements which were described in the upgrade plan. The
+sysinfo, e.g. which included system alarms, performance statistics and
+diagnostic logs, will be collected and analogized. It is required to
+resolve all of the system faults or exclude the unhealthy part before
+executing the upgrade plan.
+For avoid loss of data when a unsuccessful upgrade was encountered, the
+data should be back-upped and the system state snapshot should be taken
+before the execution of upgrade plan. This would be considered in the
+Several backups/Snapshots may be generated and stored before the single
+steps of changes. The following data/files are required to be
+1. running version files for each node.
+2. system components' configuration file and database.
+3. image and storage, if it is necessary.
+.. <MT> Does 3 imply VNF image and storage? I.e. VNF state and data?==
+.. <hujie> The following text is derived from previous "4. Negotiate
+ with the VNF if it's ready for the upgrade"
+Although the upper layer, which include VNFs and VNFMs, is out of the
+scope of Escalator, but it is still recommended to let it ready for a
+smooth system upgrade. The escalator could not guarantee the safe of
+VNFs. The upper layer should have some safe guard mechanism in design,
+and ready for avoiding failure in system upgrade.
+The execution of upgrade plan should be a dynamical procedure which is
+ controlled by Escalator.
+.. <hujie> Revised text to be general.==
+1. It is required to supporting execution ether in sequence or in
+2. It is required to check the result of the execution and take the
+ action according the situation and the policies in the upgrade plan.
+3. It is required to execute properly on various configurations of
+ system object. I.e. stand-alone, HA, etc.
+4. It is required to execute on the designated different parts of the
+ system. I.e. physical server, virtualized server, rack, chassis,
+ cluster, even different geographical places.
+The testing after upgrade the whole system or parts of system to make
+sure the upgraded system(object) is working normally.
+.. <hujie> Revised text to be general.
+1. It is recommended to run the prepared test cases to see if the
+ functionalities are available without any problem.
+2. It is recommended to check the sysinfo, e.g. system alarms,
+ performance statistics and diagnostic logs to see if there are any
+When upgrade is failure unfortunately, a quick system restore or system
+roll-back should be taken to recovery the system and the services.
+.. <hujie> Revised text to be general.
+1. It is recommend to support system restore from backup when upgrade
+ was failed.
+2. It is recommend to support graceful roll-back with reverse order
+ steps if possible.
+Escalator should continually monitor the process of upgrade. It is
+keeping update status of each module, each node, each cluster into a
+status table during upgrade.
+.. <hujie> Revised text to be general.
+1. It is required to collect the status of every objects being upgraded
+ and sending abnormal alarms during the upgrade.
+2. It is recommend to reuse the existing monitoring system, like alarm.
+3. It is recommend to support pro-actively query.
+4. It is recommend to support passively wait for notification.
+**Two possible ways for monitoring:**
+**Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
+interface. If Escalator serves as a service, it should pass on these
+**Passively Wait for Notification** requires Escalator provides
+callback interface, which could be used by NFVI/VIM systems or upgrade
+agent to send back notification.
+.. <hujie> I am not sure why not to subscribe the notification.
+Record the information generated by escalator into log files. The log
+file is used for manual diagnostic of exceptions.
+1. It is required to support logging.
+2. It is recommended to include time stamp, object id, action name,
+ error code, etc.
+Administrative Control (online)
+Administrative Control is used for control the privilege to start any
+escalator's actions for avoiding unauthorized operations.
+#. It is required to support administrative control mechanism
+#. It is recommend to reuse the system's own secure system.
+#. It is required to avoid conflicts when the system's own secure system
+ being upgraded.
+Requirements on Object being upgraded
+.. <hujie> We can develop BPs in future from requirements of this section and
+ gap analysis for upper stream projects
+Escalator focus on smooth upgrade. In practical implementation, it
+might be combined with installer/deplorer, or act as an independent
+tool/service. In either way, it requires targeting systems(NFVI and
+VIM) are developed/deployed in a way that Escalator could perform
+upgrade on them.
+On NFVI system, live-migration is likely used to maintain availability
+because OPNFV would like to make HA transparent from end user. This
+requires VIM system being able to put compute node into maintenance mode
+and then isolated from normal service. Otherwise, new NFVI instances
+might risk at being schedule into the upgrading node.
+On VIM system, availability is likely achieved by redundancy. This
+impose less requirements on system/services being upgrade (see PVA
+comments in early version). However, there should be a way to put the
+target system into standby mode. Because starting upgrade on the
+master node in a cluster is likely a bad idea.
+.. <hujie>Revised text to be general.
+1. It is required for NFVI/VIM to support **service handover** mechanism
+ that minimize interruption to 0.001%(i.e. 99.999% service
+ availability). Possible implementations are live-migration, redundant
+ deployment, etc, (Note: for VIM, interruption could be less
+2. It is required for NFVI/VIM to restore the early version in a efficient
+ way, such as **snapshot**.
+3. It is required for NFVI/VIM to **migration data** efficiently between
+ base and upgraded system.
+4. It is recommend for NFV/VIM's interface to support upgrade
+ orchestration, e.g. reading/setting system state.
+Availability mechanism, etc.