14 files changed, 932 insertions, 814 deletions
diff --git a/INFO b/INFO
index cc2abd0..a134220 100644
--- a/INFO
+++ b/INFO
@@ -27,6 +27,7 @@ huangzhipeng@huawei.com
 meng.jia@zte.com.cn
 liyi.meng@ericsson.com
 pasi.vaananen@stratus.com
+wang.guobing1@zte.com.cn
 
 Link to TSC approval of the project:
 http://meetbot.opnfv.org/meetings/opnfv-meeting/2015/opnfv-meeting.2015-04-21-14.00.html
diff --git a/doc/00-Authors.rst b/doc/00-Authors.rst
new file mode 100644
index 0000000..fdbf61b
--- /dev/null
+++ b/doc/00-Authors.rst
@@ -0,0 +1,15 @@
+Authors:
+--------
+
+| Jie Hu (ZTE, hu.jie@zte.com.cn)
+| Qiao Fu (China Mobile, fuqiao@chinamobile.com)
+| Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com)
+| Maria Toeroe (Ericsson, maria.toeroe@ericsson.com)
+| Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com)
+| Zhong Chao (ZTE, chao.zhong@zte.com.cn)
+| Julien Zhang (ZTE, zhang.jun3g@zte.com.cn)
+| Yuri Yuan (ZTE, yuan.yue@zte.com.cn)
+| Zhipeng Huang (Huawei, huangzhipeng@huawei.com)
+| Jia Meng (ZTE, meng.jia@zte.com.cn)
+| Liyi Meng (Ericsson, liyi.meng@ericsson.com)
+| Pasi Vaananen (Stratus, pasi.vaananen@stratus.com)
+\ No newline at end of file
diff --git a/doc/01-Scope.rst b/doc/01-Scope.rst
new file mode 100644
index 0000000..5247e40
--- /dev/null
+++ b/doc/01-Scope.rst
@@ -0,0 +1,28 @@
+Scope
+-----
+
+This document describes the user requirements on the smooth upgrade
+function of the NFVI and VIM with respect to the upgrades of the OPNFV
+platform from one version to another. Smooth upgrade means that the
+upgrade results in no service outage for the end-users. This requires
+that the process of the upgrade is automatically carried out by a tool
+(code name: Escalator) with pre-configured data. The upgrade process
+includes preparation, validation, execution, monitoring and
+conclusion.
+  
+.. <MT> While it is good to have a tool for the entire upgrade process,
+   but it is a challenging task, so maybe we shouldn't require automation
+   for the entire process right away. Automation is essential at
+   execution.
+  
+.. <hujie> Maybe we can analysis information flows of the upgrade tool,
+   abstract the basic / essential actions from the tool (or tools), and
+   map them to a command set of NFVI / VIM's interfaces.
+
+The requirements are defined in a stepwise approach, i.e. in the first
+phase focusing on the upgrade of the VIM then widening the scope to the
+NFVI.
+
+The requirements may apply to different NFV functions (NFVI, or VIM, or
+both of them). They will be classified in the Appendix of this
+document.
+\ No newline at end of file
diff --git a/doc/02-Background_and_Terminologies.rst b/doc/02-Background_and_Terminologies.rst
new file mode 100644
index 0000000..afb392f
--- /dev/null
+++ b/doc/02-Background_and_Terminologies.rst
@@ -0,0 +1,458 @@
+General Requirements Background and Terminology
+-----------------------------------------------
+
+Terminologies and definitions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+NFVI
+  The term is an abbreviation for Network Function Virtualization
+  Infrastructure; sometimes it is also referred as data plane in this
+  document.
+
+VIM
+  The term is an abbreviation for Virtual Infrastructure Management;
+  sometimes it is also referred as control plane in this document.
+   
+Operator
+  The term refers to network service providers and Virtual Network
+  Function (VNF) providers.
+
+End-User
+  The term refers to a subscriber of the Operator's services.
+
+Network Service
+  The term refers to a service provided by an Operator to its
+  End-users using a set of (virtualized) Network Functions
+
+Infrastructure Services
+  The term refers to services provided by the NFV Infrastructure and the 
+  the Management & Orchestration functions to the VNFs. I.e. 
+  these are the virtual resources as perceived by the VNFs.
+
+Smooth Upgrade
+  The term refers to an upgrade that results in no service outage 
+  for the end-users.
+
+Rolling Upgrade
+  The term refers to an upgrade strategy that upgrades each node or
+  a subset of nodes in a wave style rolling through the data centre. It
+  is a popular upgrade strategy to maintain service availability.
+
+Parallel Universe
+  The term refers to an upgrade strategy that creates and deploys
+  a new universe - a system with the new configuration - while the old
+  system continues running. The state of the old system is transferred
+  to the new system after sufficient testing of the new system.
+
+Infrastructure Resource Model
+  The term refers to the representation of infrastructure resources,
+  namely: the physical resources, the virtualization
+  facility resources and the virtual resources.
+
+Physical Resource
+  The term refers to a hardware pieces of the NFV infrastructure, which may
+  also include the firmware which enables the hardware.
+
+Virtual Resource
+  The term refers to a resource, which is provided as services built on top
+  of the physical resources via the virtualization facilities; in particular,
+  they are the resources on which VNF entities are deployed, e.g.
+  the VMs, virtual switches, virtual routers, virtual disks etc.
+
+.. <MT> I don't think the VNF is the virtual resource. Virtual
+   resources are the VMs, virtual switches, virtual routers, virtual
+   disks etc. The VNF uses them, but I don't think they are equal. The
+   VIM doesn't manage the VNF, but it does manage virtual resources.
+   
+Visualization Facility
+   The term refers to a resource that enables the creation
+   of virtual environments on top of the physical resources, e.g.
+   hypervisor, OpenStack, etc.
+
+Upgrade Plan (or Campaign?) 
+   The term refers to a choreography that describes how the upgrade should
+   be performed in terms of its targets (i.e. upgrade objects), the
+   steps/actions required of upgrading each, and the coordination of these
+   steps so that service availability can be maintained. It is an input to an
+   upgrade tool (Escalator) to carry out the upgrade 
+
+
+Upgrade Objects
+~~~~~~~~~~~~~~~
+
+Physical Resource
+^^^^^^^^^^^^^^^^^
+
+Most of cloud infrastructures support dynamic addition/removal of
+hardware. A hardware upgrade could be done by adding the new 
+hardware node and removing the old one. From the persepctive of smooth
+upgrade the orchestration/scheduling of this actions is the primary concern.
+Upgrading a physical resource,
+like upgrading its firmware and/or modify its configuration data, may
+also be considered in the future. 
+
+
+Virtual Resources
+^^^^^^^^^^^^^^^^^
+
+Virtual resource upgrade mainly done by users. OPNFV may facilitate
+the activity, but suggest to have it in long term roadmap instead of
+initiate release.
+
+.. <MT> same comment here: I don't think the VNF is the virtual
+  resource. Virtual resources are the VMs, virtual switches, virtual
+  routers, virtual disks etc. The VNF uses them, but I don't think they
+  are equal. For example if by some reason the hypervisor is changed and
+  the current VMs cannot be migrated to the new hypervisor, they are
+  incompatible, then the VMs need to be upgraded too. This is not
+  something the NFVI user (i.e. VNFs ) would even know about.
+
+
+Virtualization Facility Resources
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Based on the functionality they provide, virtualization facility
+resources could be divided into computing node, networking node,
+storage node and management node.
+
+The possible upgrade objects in these nodes are addressed below:
+(Note: hardware based virtualization may be considered as virtualization
+facility resource, but from escalator perspective, it is better to
+consider it as part of the hardware upgrade. )
+
+**Computing node**
+
+1. OS Kernel
+
+2. Hypvervisor and virtual switch
+
+3. Other kernel modules, like driver
+
+4. User space software packages, like nova-compute agents and other
+   control plane programs.
+
+Updating 1 and 2 will cause the loss of virtualzation functionality of
+the compute node, which may lead to data plane services interruption
+if the virtual resource is not redudant.
+
+Updating 3 might result the same.
+
+Updating 4 might lead to control plane services interruption if not an
+HA deployment.
+
+**Networking node**
+
+1. OS kernel, optional, not all switches/routers allow the upgrade their
+   OS since it is more like a firmware than a generic OS.
+
+2. User space software package, like neutron agents and other control
+   plane programs
+
+Updating 1 if allowed will cause a node reboot and therefore leads to
+data plane service interruption if the virtual resource is not
+redundant.
+
+Updating 2 might lead to control plane services interruption if not an
+HA deployment.
+
+**Storage node**
+
+1. OS kernel, optional, not all storage nodes allow the upgrade their OS
+   since it is more like a firmware than a generic OS.
+
+2. Kernel modules
+
+3. User space software packages, control plane programs
+
+Updating 1 if allowed will cause a node reboot and therefore leads to
+data plane services interruption if the virtual resource is not
+redundant.
+
+Update 2 might result in the same.
+
+Updating 3 might lead to control plane services interruption if not an
+HA deployment.
+
+**Management node**
+
+1. OS Kernel
+
+2. Kernel modules, like driver
+
+3. User space software packages, like database, message queue and
+   control plane programs.
+
+Updating 1 will cause a node reboot and therefore leads to control
+plane services interruption if not an HA deployment. Updating 2 might
+result in the same.
+
+Updating 3 might lead to control plane services interruption if not an
+HA deployment.
+
+Upgrade Span
+~~~~~~~~~~~~
+
+**Major Upgrade**
+
+Upgrades between major releases may introducing significant changes in
+function, configuration and data, such as the upgrade of OPNFV from
+Arno to Brahmaputra.
+
+**Minor Upgrade**
+
+Upgrades inside one major releases which would not leads to changing
+the structure of the platform and may not infect the schema of the
+system data.
+
+Upgrade Granularity
+~~~~~~~~~~~~~~~~~~~
+
+Physical/Hardware Dimension
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Support full / partial upgrade for data centre, cluster, zone. Because
+of the upgrade of a data centre or a zone, it may be divided into
+several batches. The upgrade of a cloud environment (cluster) may also
+be partial. For example, in one cloud environment running a number of
+VNFs, we may just try one of them to check the stability and
+performance, before we upgrade all of them.
+
+Software Dimension
+^^^^^^^^^^^^^^^^^^
+
+-  The upgrade of host OS or kernel may need a 'hot migration'
+-  The upgrade of OpenStack’s components
+
+    i.the one-shot upgrade of all components
+	
+    ii.the partial upgrade (or bugfix patch) which only affects some
+       components (e.g., computing, storage, network, database, message
+       queue, etc.)
+
+.. <MT> this section seems to overlap with 2.1.
+  I can see the following dimensions for the software.
+
+.. <MT> different software packages
+
+.. <MT> different functions - Considering that the target versions of all
+   software are compatible the upgrade needs to ensure that any
+   dependencies between SW and therefore packages are taken into account
+   in the upgrade plan, i.e. no version mismatch occurs during the
+   upgrade therefore dependencies are not broken
+   
+.. <MT> same function - This is an upgrade specific question if different
+   versions can coexist in the system when a SW is being upgraded from
+   one version to another. This is particularly important for stateful
+   functions e.g. storage, networking, control services. The upgrade
+   method must consider the compatibility of the redundant entities.
+
+.. <MT> different versions of the same software package
+
+.. <MT> major version changes - they may introduce incompatibilities. Even
+   when there are backward compatibility requirements changes may cause
+   issues at graceful roll-back
+   
+.. <MT> minor version changes - they must not introduce incompatibility
+   between versions, these should be primarily bug fixes, so live
+   patches should be possible
+   
+.. <MT> different installations of the same software package
+
+.. <MT> using different installation options - they may reflect different
+   users with different needs so redundancy issues are less likely
+   between installations of different options; but they could be the
+   reflection of the heterogeneous system in which case they may provide
+   redundancy for higher availability, i.e. deeper inspection is needed
+   
+.. <MT> using the same installation options - they often reflect that the are
+   used by redundant entities across space
+   
+.. <MT> different distribution possibilities in space - same or different
+   availability zones, multi-site, geo-redundancy
+   
+.. <MT> different entities running from the same installation of a software
+   package
+   
+.. <MT>  using different start-up options - they may reflect different users so
+   redundancy may not be an issues between them
+   
+.. <MT> using same start-up options - they often reflect redundant
+   entities
+
+Upgrade duration
+~~~~~~~~~~~~~~~~
+
+As the OPNFV end-users are primarily Telecom operators, the network
+services provided by the VNFs deployed on the NFVI should meet the
+requirement of 'Carrier Grade'.::
+
+  In telecommunication, a "carrier grade" or"carrier class" refers to a
+  system, or a hardware or software component that is extremely reliable,
+  well tested and proven in its capabilities. Carrier grade systems are
+  tested and engineered to meet or exceed "five nines" high availability
+  standards, and provide very fast fault recovery through redundancy
+  (normally less than 50 milliseconds). [from wikipedia.org]
+
+"five nines" means working all the time in ONE YEAR except 5'15".
+
+::
+
+  We have learnt that a well prepared upgrade of OpenStack needs 10
+  minutes. The major time slot in the outage time is used spent on
+  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
+  ' by Symantec]
+
+This 10 minutes of downtime of the OpenStack services however did not impact the
+users, i.e. the VMs running on the compute nodes. This was the outage of
+the control plane only. On the other hand with respect to the
+preparations this was a manually tailored upgrade specific to the
+particular deployment and the versions of each OpenStack service.
+
+The project targets to achieve a more generic methodology, which however
+requires that the upgrade objects fulfil certain requirements. Since
+this is only possible on the long run we target first the upgrade
+of the different VIM services from version to version.
+
+**Questions:**
+
+1. Can we manage to upgrade OPNFV in only 5 minutes?
+ 
+.. <MT> The first question is whether we have the same carrier grade
+   requirement on the control plane as on the user plane. I.e. how
+   much control plane outage we can/willing to tolerate?
+   In the above case probably if the database is only half of the size
+   we can do the upgrade in 5 minutes, but is that good? It also means
+   that if the database is twice as much then the outage is 20
+   minutes.
+   For the user plane we should go for less as with two release yearly
+   that means 10 minutes outage per year.
+
+.. <Malla> 10 minutes outage per year to the users? Plus, if we take
+   control plane into the consideration, then total outage will be
+   more than 10 minute in whole network, right?
+
+.. <MT> The control plane outage does not have to cause outage to
+   the users, but it may of course depending on the size of the system
+   as it's more likely that there's a failure that needs to be handled
+   by the control plane.
+
+2. Is it acceptable for end users ? Such as a planed service
+   interruption will lasting more than ten minutes for software
+   upgrade.
+
+.. <MT> For user plane, no it's not acceptable in case of
+   carrier-grade. The 5' 15" downtime should include unplanned and
+   planned downtimes.
+   
+.. <Malla> I go agree with Maria, it is not acceptable.
+
+3. Will any VNFs still working well when VIM is down?
+
+.. <MT> In case of OpenStack it seems yes. .:)
+
+The maximum duration of an upgrade
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The duration of an upgrade is related to and proportional with the
+scale and the complexity of the OPNFV platform as well as the
+granularity (in function and in space) of the upgrade.
+
+.. <Malla> Also, if is a partial upgrade like module upgrade, it depends
+  also on the OPNFV modules and their tight connection entities as well.
+
+.. <MT> Since the maintenance window is shrinking and becoming non-existent
+  the duration of the upgrade is secondary to the requirement of smooth upgrade.
+  But probably we want to be able to put a time constraint on each upgrade
+  during which it must complete otherwise it is considered failed and the system
+  should be rolled back. I.e. in case of automatic execution it might not be clear
+  if an upgrade is long or just hanging. The time constraints may be a function
+  of the size of the system in terms of the upgrade object(s).
+
+The maximum duration of a roll back when an upgrade is failed 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The duration of a roll back is short than the corresponding upgrade. It
+depends on the duration of restore the software and configure data from
+pre-upgrade backup / snapshot.
+
+.. <MT> During the upgrade process two types of failure may happen:
+  In case we can recover from the failure by undoing the upgrade
+  actions it is possible to roll back the already executed part of the
+  upgrade in graceful manner introducing no more service outage than
+  what was introduced during the upgrade. Such a graceful roll back
+  requires typically the same amount of time as the executed portion of
+  the upgrade and impose minimal state/data loss.
+  
+.. <MT> Requirement: It should be possible to roll back gracefully the
+  failed upgrade of stateful services of the control plane.
+  In case we cannot recover from the failure by just undoing the
+  upgrade actions, we have to restore the upgraded entities from their
+  backed up state. In other terms the system falls back to an earlier
+  state, which is typically a faster recovery procedure than graceful
+  roll back and depending on the statefulness of the entities involved it
+  may result in significant state/data loss.
+  
+.. <MT> Two possible types of failures can happen during an upgrade
+
+.. <MT> We can recover from the failure that occurred in the upgrade process:
+  In this case, a graceful rolling back of the executed part of the
+  upgrade may be possible which would "undo" the executed part in a
+  similar fashion. Thus, such a roll back introduces no more service
+  outage during an upgrade than the executed part introduced. This
+  process typically requires the same amount of time as the executed
+  portion of the upgrade and impose minimal state/data loss.
+
+.. <MT> We cannot recover from the failure that occurred in the upgrade
+   process: In this case, the system needs to fall back to an earlier
+   consistent state by reloading this backed-up state. This is typically
+   a faster recovery procedure than the graceful roll back, but can cause
+   state/data loss. The state/data loss usually depends on the
+   statefulness of the entities whose state is restored from the backup.
+
+The maximum duration of a VNF interruption (Service outage)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since not the entire process of a smooth upgrade will affect the VNFs,
+the duration of the VNF interruption may be shorter than the duration
+of the upgrade. In some cases, the VNF running without the control
+from of the VIM is acceptable.
+
+.. <MT> Should require explicitly that the NFVI should be able to
+  provide its services to the VNFs independent of the control plane?
+
+.. <MT> Requirement: The upgrade of the control plane must not cause
+  interruption of the NFVI services provided to the VNFs.
+
+.. <MT> With respect to carrier-grade the yearly service outage of the
+  VNF should not exceed 5' 15" regardless whether it is planned or
+  unplanned outage. Considering the HA requirements TL-9000 requires an
+  end-to-end service recovery time of 15 seconds based on which the ETSI
+  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
+  availability levels (SAL). The proposed example service recovery times
+  for these levels are:
+
+.. <MT> SAL1: 5-6 seconds
+
+.. <MT> SAL2: 10-15 seconds
+
+.. <MT> SAL3: 20-25 seconds
+
+.. <Pva> my comment was actually that the downtime metrics of the
+  underlying elements, components and services are small fraction of the
+  total E2E service availability time. No-one on the E2E service path
+  will get the whole downtime allocation (in this context it includes
+  upgrade process related outages for the services provided by VIM etc.
+  elements that are subject to upgrade process).
+  
+.. <MT> So what you are saying is that the upgrade of any entity
+  (component, service) shouldn't cause even this much service
+  interruption. This was the reason I brought these figures here as well
+  that they are posing some kind of upper-upper boundary. Ideally the
+  interruption is in the millisecond range i.e. no more than a
+  switch-over or a live migration.
+  
+.. <MT> Requirement: Any interruption caused to the VNF by the upgrade
+  of the NFVI should be in the sub-second range.
+
+.. <MT]> In the future we also need to consider the upgrade of the NFVI,
+  i.e. HW, firmware, hypervisors, host OS etc.
+\ No newline at end of file
diff --git a/doc/03-Functional_Requirements.rst b/doc/03-Functional_Requirements.rst
new file mode 100644
index 0000000..c0695bb
--- /dev/null
+++ b/doc/03-Functional_Requirements.rst
@@ -0,0 +1,240 @@
+Functional Requirements
+-----------------------
+
+Basic Actions
+~~~~~~~~~~~~~
+
+This section describes the basic functions may required by Escalator.
+
+Preparation (offline)
+^^^^^^^^^^^^^^^^^^^^^
+
+This is the design phase when the upgrade plan (or upgrade campaign) is
+being designed so that it can be executed automatically with minimal
+service outage. It may include the following work:
+
+1. Check the dependencies of the software modules and their impact,
+   backward compatibilities to figure out the appropriate upgrade method
+   and ordering.
+2. Find out if a rolling upgrade could be planned with several rolling
+   steps to avoid any service outage due to the upgrade some
+   parts/services at the same time.
+3. Collect the proper version files and check the integration for
+   upgrading.
+4. The preparation step should produce an output (i.e. upgrade
+   campaign/plan), which is executable automatically in an NFV Framework
+   and which can be validated before execution.
+
+   -  The upgrade campaign should not be referring to scalable entities
+      directly, but allow for adaptation to the system configuration and
+      state at any given moment.
+   -  The upgrade campaign should describe the ordering of the upgrade
+      of different entities so that dependencies, redundancies can be
+      maintained during the upgrade execution
+   -  The upgrade campaign should provide information about the
+      applicable recovery procedures and their ordering.
+   -  The upgrade campaign should consider information about the
+      verification/testing procedures to be performed during the upgrade
+      so that upgrade failures can be detected as soon as possible and
+      the appropriate recovery procedure can be identified and applied.
+   -  The upgrade campaign should provide information on the expected
+      execution time so that hanging execution can be identified
+   -  The upgrade campaign should indicate any point in the upgrade when
+      coordination with the users (VNFs) is required.
+
+.. <hujie> Depends on the attributes of the object being upgraded, the
+  upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
+  more small sub-plans in design phase. The plan(s) or sub-plan(s) my
+  include step(s) or sub-plan(s).
+
+Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The upgrade plan should be validated before the execution by testing
+it in a test environment which is similar to the product environment.
+
+.. <MT> However it could also mean that we can identify some properties
+  that it should satisfy e.g. what operations can or cannot be executed
+  simultaneously like never take out two VMs of the same VNF.
+  
+.. <MT> Another question is if it requires that the system is in a particular
+  state when the upgrade is applied. I.e. if there's certain amount of
+  redundancy in the system, migration is enabled for VMs, when the NFVI
+  is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
+  healthy, etc.
+  
+.. <MT> I'm not sure what online validation means: Is it the validation of the
+  upgrade plan/campaign or the validation of the system that it is in a
+  state that the upgrade can be performed without too much risk?==
+
+Before the upgrade plan being executed, the system healthy of the
+online product environment should be checked and confirmed to satisfy
+the requirements which were described in the upgrade plan. The
+sysinfo, e.g. which included system alarms, performance statistics and
+diagnostic logs, will be collected and analogized. It is required to
+resolve all of the system faults or exclude the unhealthy part before
+executing the upgrade plan.
+
+
+Backup/Snapshot (online)
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For avoid loss of data when a unsuccessful upgrade was encountered, the
+data should be back-upped and the system state snapshot should be taken
+before the execution of upgrade plan. This would be considered in the
+upgrade plan.
+
+Several backups/Snapshots may be generated and stored before the single
+steps of changes. The following data/files are required to be
+considered:
+
+1. running version files for each node.
+2. system components' configuration file and database.
+3. image and storage, if it is necessary.
+
+.. <MT> Does 3 imply VNF image and storage? I.e. VNF state and data?==
+
+.. <hujie> The following text is derived from previous "4. Negotiate
+  with the VNF if it's ready for the upgrade"
+  
+Although the upper layer, which include VNFs and VNFMs, is out of the
+scope of Escalator, but it is still recommended to let it ready for a
+smooth system upgrade. The escalator could not guarantee the safe of
+VNFs. The upper layer should have some safe guard mechanism in design,
+and ready for avoiding failure in system upgrade.
+
+Execution (online)
+^^^^^^^^^^^^^^^^^^
+
+The execution of upgrade plan should be a dynamical procedure which is
+  controlled by Escalator.
+
+.. <hujie> Revised text to be general.==
+
+1. It is required to supporting execution ether in sequence or in
+   parallel.
+2. It is required to check the result of the execution and take the
+   action according the situation and the policies in the upgrade plan.
+3. It is required to execute properly on various configurations of
+   system object. I.e. stand-alone, HA, etc.
+4. It is required to execute on the designated different parts of the
+   system. I.e. physical server, virtualized server, rack, chassis,
+   cluster, even different geographical places.
+
+Testing (online)
+^^^^^^^^^^^^^^^^
+
+The testing after upgrade the whole system or parts of system to make
+sure the upgraded system(object) is working normally.
+
+.. <hujie> Revised text to be general.
+
+1. It is recommended to run the prepared test cases to see if the
+   functionalities are available without any problem.
+2. It is recommended to check the sysinfo, e.g. system alarms,
+   performance statistics and diagnostic logs to see if there are any
+   abnormal.
+
+Restore/Roll-back (online)
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When upgrade is failure unfortunately, a quick system restore or system
+roll-back should be taken to recovery the system and the services.
+
+.. <hujie> Revised text to be general.
+
+1. It is recommend to support system restore from backup when upgrade
+   was failed.
+2. It is recommend to support graceful roll-back with reverse order
+   steps if possible.
+
+Monitoring (online)
+^^^^^^^^^^^^^^^^^^^
+
+Escalator should continually monitor the process of upgrade. It is
+keeping update status of each module, each node, each cluster into a
+status table during upgrade.
+
+.. <hujie> Revised text to be general.
+
+1. It is required to collect the status of every objects being upgraded
+   and sending abnormal alarms during the upgrade.
+2. It is recommend to reuse the existing monitoring system, like alarm.
+3. It is recommend to support pro-actively query.
+4. It is recommend to support passively wait for notification.
+
+**Two possible ways for monitoring:**
+
+**Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
+interface. If Escalator serves as a service, it should pass on these
+interfaces.
+
+**Passively Wait for Notification** requires Escalator provides
+callback interface, which could be used by NFVI/VIM systems or upgrade
+agent to send back notification.
+
+.. <hujie> I am not sure why not to subscribe the notification.
+
+Logging (online)
+^^^^^^^^^^^^^^^^
+
+Record the information generated by escalator into log files. The log
+file is used for manual diagnostic of exceptions.
+
+1. It is required to support logging.
+2. It is recommended to include time stamp, object id, action name,
+   error code, etc.
+
+Administrative Control (online)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Administrative Control is used for control the privilege to start any
+escalator's actions for avoiding unauthorized operations.
+
+#. It is required to support administrative control mechanism
+#. It is recommend to reuse the system's own secure system.
+#. It is required to avoid conflicts when the system's own secure system
+   being upgraded.
+
+Requirements on Object being upgraded
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. <hujie> We can develop BPs in future from requirements of this section and
+  gap analysis for upper stream projects
+  
+Escalator focus on smooth upgrade. In practical implementation, it
+might be combined with installer/deplorer, or act as an independent
+tool/service. In either way, it requires targeting systems(NFVI and
+VIM) are developed/deployed in a way that Escalator could perform
+upgrade on them.
+
+On NFVI system, live-migration is likely used to maintain availability
+because OPNFV would like to make HA transparent from end user. This
+requires VIM system being able to put compute node into maintenance mode
+and then isolated from normal service. Otherwise, new NFVI instances
+might risk at being schedule into the upgrading node.
+
+On VIM system, availability is likely achieved by redundancy. This
+impose less requirements on system/services being upgrade (see PVA
+comments in early version). However, there should be a way to put the
+target system into standby mode. Because starting upgrade on the
+master node in a cluster is likely a bad idea.
+
+.. <hujie>Revised text to be general.
+
+1. It is required for NFVI/VIM to support **service handover** mechanism
+   that minimize interruption to 0.001%(i.e. 99.999% service
+   availability). Possible implementations are live-migration, redundant
+   deployment, etc, (Note: for VIM, interruption could be less
+   restrictive)
+   
+2. It is required for NFVI/VIM to restore the early version in a efficient
+   way, such as **snapshot**.
+   
+3. It is required for NFVI/VIM to **migration data** efficiently between
+   base and upgraded system.
+
+4. It is recommend for NFV/VIM's interface to support upgrade
+   orchestration, e.g. reading/setting system state.
+
+
diff --git a/doc/04-Use_Cases_and_Scenarios.rst b/doc/04-Use_Cases_and_Scenarios.rst
new file mode 100644
index 0000000..13d16cf
--- /dev/null
+++ b/doc/04-Use_Cases_and_Scenarios.rst
@@ -0,0 +1,32 @@
+Use Cases and Scenarios
+-----------------------
+
+This section describes the use cases and scenarios to verify the 
+requirements of Escalator.
+
+Upgrade a system with minimal configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A minimal configuration system is normally deployed for experimental or
+development usages, such as a OPNFV test bed.  Although it dose not have
+large workload, but it is a typical system to be upgraded frequently.
+
+Upgrade a system with HA configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A HA configuration system is very popular in the operator's data centre.
+And it is a typical product environment. It always running 7 \* 24 a
+week with VNFs running on it to provide services to the end users.
+
+Upgrade a system with Multi-Site configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Upgrade in one site may cause service interruption to other site, if
+both sites are depended and sharing the same modules/data base (e.g. a
+keystone for both sites).
+
+If a site failure during an upgrade, the roll-back missing any minimal
+state/data loss can cause an affect/failure to the depended site.
+
+.. <hujie> Consider one site of ARNO release first. Then, multi-site 
+  in the future.
+\ No newline at end of file
diff --git a/doc/05-Reference_Architecture.rst b/doc/05-Reference_Architecture.rst
new file mode 100644
index 0000000..1b16dbe
--- /dev/null
+++ b/doc/05-Reference_Architecture.rst
@@ -0,0 +1,6 @@
+Reference Architecture
+----------------------
+
+This section describes the reference architecture, the function blocks,
+the function entities of Escalator for the reader to well understand how
+the basic functions be organized.
+\ No newline at end of file
diff --git a/doc/06-Information_Flows.rst b/doc/06-Information_Flows.rst
new file mode 100644
index 0000000..14f2908
--- /dev/null
+++ b/doc/06-Information_Flows.rst
@@ -0,0 +1,8 @@
+Information Flows
+-----------------
+
+This section describes the information flows among the function
+entities when Escalator is in actions.
+
+.. <hujie> We should consider a generic procedure / frameworks of upgrading. And
+  may provide plug-ins interface for specialized tasks
+\ No newline at end of file
diff --git a/doc/07-Interfaces_and_Files.rst b/doc/07-Interfaces_and_Files.rst
new file mode 100644
index 0000000..87f916e
--- /dev/null
+++ b/doc/07-Interfaces_and_Files.rst
@@ -0,0 +1,27 @@
+Interfaces and Files
+--------------------
+
+This section describes the required interfaces and files of Escalator.
+
+
+CLI Interface
+~~~~~~~~~~~~~~~~
+
+This section describes CLI of Escalator. 
+
+RESTful API
+~~~~~~~~~~~
+
+This section describes the API of Escalator for developer.
+
+Configuration File
+~~~~~~~~~~~~~~~~~~
+
+This section will suggest a format of the configuration files and how to
+deal with it.
+
+Log File
+~~~~~~~~
+
+This section will suggest a format of the log files and how to deal with
+it.
+\ No newline at end of file
diff --git a/doc/08-Requirements_from_other_OPNFV_Project.rst b/doc/08-Requirements_from_other_OPNFV_Project.rst
new file mode 100644
index 0000000..62e611f
--- /dev/null
+++ b/doc/08-Requirements_from_other_OPNFV_Project.rst
@@ -0,0 +1,40 @@
+Requirements from other OPNFV projects
+--------------------------------------
+
+We have created a questionnaire_ for collecting other projects requirements.
+Please advertise it.
+
+.. _questionnaire: https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form
+  
+
+
+Doctor Project
+~~~~~~~~~~~~~~
+
+.. <Malla> This scenario could be out of scope in Escalator project, but
+  having the option to support this should be better to align with
+  Doctor requirements.
+  
+The scope of Doctor project also covers maintenance scenario in which
+
+1. The VIM administrator requests host maintenance to VIM.
+
+2. VIM will notify it to consumer such as VNFM to trigger application level
+   migration or switching active-standby nodes.
+
+3. VIM waits response from the consumer for a short while.
+
+-  VIM should send out notification of VM migration to consumer (VNFM)
+   as abstracted message like "maintenance".
+
+-  VIM could wait VM migration until it receives "VM ready to
+   maintenance" message from the owner (VNFM)
+
+HA Project
+~~~~~~~~~~
+
+Multi-site Project
+~~~~~~~~~~~~~~~~~~
+
+-  Escalator upgrade one site should at least not lead to the other site
+   API token validation failed.
diff --git a/doc/09-Reference.rst b/doc/09-Reference.rst
new file mode 100644
index 0000000..0b5ff17
--- /dev/null
+++ b/doc/09-Reference.rst
@@ -0,0 +1,17 @@
+Reference
+---------
+
+[1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework”
+
+[2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV"
+
+[3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture”
+
+[4] ETSI GS NFV-MAN001:“Management and Orchestration”
+
+[5] ETSI GS NFV-REL001:"Resiliency Requirements"
+
+[6] QuEST Forum TL-9000:"Quality Management System Requirement
+Handbook"
+
+[7] Service Availability Forum AIS:"Software Management Framework"
diff --git a/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst b/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst
new file mode 100644
index 0000000..5c2195b
--- /dev/null
+++ b/doc/10-Useful_Working_Drafts_of_ETSI_NFV.rst
@@ -0,0 +1,11 @@
+Useful Working Drafts of ETSI NFV
+---------------------------------
+
+Access them with your own ETSI account, please DO NOT disclose the
+content.
+
+[1] Migrate Virtualised Compute Resource operation @ 7.3.1.8
+ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip
+
+[2] Reliability issues during NFV Software upgrade and improvement mechanisms @ 8
+ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip
diff --git a/doc/A1-Appendix.rst b/doc/A1-Appendix.rst
new file mode 100644
index 0000000..85f0717
--- /dev/null
+++ b/doc/A1-Appendix.rst
@@ -0,0 +1,49 @@
+Appendix
+--------
+
+A.1 Impact Analysis
+~~~~~~~~~~~~~~~~~~~
+
+Upgrading the different software modules may cause different impact on
+the availability of the infrastructure resources and even on the service
+continuity of the vNFs.
+
+**Software modules in the computing nodes**
+
+#. Host OS patch
+
+#. Hypervisor, such as KVM, QEMU, XEN, libvirt
+#. Openstack agent in computing nodes (like Nova agent, Ceilometer
+   agent...)
+   
+.. <MT> As SW module, we should list the host OS and maybe its
+   drivers as well. From upgrade perspective do we limit host OS
+   upgrades to patches only?
+
+**Software modules in network nodes**
+
+#. Neutron L2/L3 agent
+#. OVS, SR-IOV Driver
+
+**Software modules storage nodes**
+
+#. Ceph
+
+The table below analyses such an impact - considering a single instance
+of each software module - from the following aspects:
+
+-  the function which will be lost during upgrade,
+-  the duration of the loss of this specific function,
+-  if this causes the loss of the vNF function,
+-  if it causes incompatibility in the different parts of the software,
+-  what should be backed up before the upgrade,
+-  the duration of restoration time if the upgrade fails
+
+These values provided come from internal testing and based on some
+assumptions, they may vary depending on the deployment techniques.
+Please feel free to add if you find more efficient values during your
+testing.
+
+https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx
+
+Note that no redundancy of the software modules is considered in the table.
diff --git a/doc/Escalator_Requirement.rst b/doc/Escalator_Requirement.rst
deleted file mode 100644
index e80a11d..0000000
--- a/doc/Escalator_Requirement.rst
+++ /dev/null
@@ -1,814 +0,0 @@
-Draft Escalator Requirement v0.4
-================================
-
-Authors:
---------
-
-| Jie Hu (ZTE, hu.jie@zte.com.cn)
-| Qiao Fu (China Mobile, fuqiao@chinamobile.com)
-| Ulrich Kleber (Huawei, Ulrich.Kleber@huawei.com)
-| Maria Toeroe (Ericsson, maria.toeroe@ericsson.com)
-| Sama, Malla Reddy (DOCOMO, sama@docomolab-euro.com)
-| Zhong Chao (ZTE, chao.zhong@zte.com.cn)
-| Julien Zhang (ZTE, zhang.jun3g@zte.com.cn)
-| Yuri Yuan (ZTE, yuan.yue@zte.com.cn)
-| Zhipeng Huang (Huawei, huangzhipeng@huawei.com)
-| Jia Meng (ZTE, meng.jia@zte.com.cn)
-| Liyi Meng (Ericsson, liyi.meng@ericsson.com)
-| Pasi Vaananen (Stratus, pasi.vaananen@stratus.com)
-
-1. Scope
---------
-
-| This document describes the user requirements on the smooth upgrade
-  function of the NFVI and VIM with respect to the upgrades of the OPNFV
-  platform from one version to another. Smooth upgrade means that the
-  upgrade results in no service outage for the end-users. This requires
-  that the process of the upgrade is automatically carried out by a tool
-  (code name: Escalator) with pre-configured data. The upgrade process
-  includes preparation, validation, execution, monitoring and
-  conclusion.
-| ==[MT] While it is good to have a tool for the entire upgrade process,
-  but it is a challenging task, so maybe we shouldn't require automation
-  for the entire process right away. Automation is essential at
-  execution.==
-| ==[hujie] Maybe we can analysis information flows of the upgrade tool,
-  abstract the basic / essential actions from the tool (or tools), and
-  map them to a command set of NFVI / VIM's interfaces.==
-
-The requirements are defined in a stepwise approach, i.e. in the first
-phase focusing on the upgrade of the VIM then widening the scope to the
-NFVI.
-
-The requirements may apply to different NFV functions (NFVI, or VIM, or
-both of them) . They will be classified in the Appendix of this
-document.
-
-2. General Requirements Background and terminology
---------------------------------------------------
-
-==[MT] At the moment 2.1-2.3 seem to be more background sections than
-requirements. Should we rename this part?==
-
-2.1 Terminologies and definitions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
--  **NFVI** is abbreviation for Network Function Virtualization
-   Infrastructure; sometimes it is also referred as data plane in this
-   document.
--  **VIM** is abbreviation for Virtual Infrastructure Management;
-   sometimes it is also referred as control plane in this document.
--  **Operators** are network service providers and Virtual Network
-   Function (VNF) providers.
--  **End-Users** are subscribers of Operator's services.
--  **Network Service** is a service provided by an Operator to its
-   End-users using a set of (virtualized) Network Functions
--  **Infrastructure Services** are those provided by the NFV
-   Infrastructure and the Management & Orchestration functions to the
-   VNFs. I.e. these are the virtual resources as perceived by the VNFs.
--  **Smooth Upgrade** means that the upgrade results in no service
-   outage for the end-users.
--  **Rolling Upgrade** is an upgrade strategy that upgrades each node or
-   a subset of nodes in a wave rolling style through the data centre. It
-   is a popular upgrade strategy to maintains service availability.
--  **Parallel Universe** is an upgrade strategy that creates and deploys
-   a new universe - a system with the new configuration - while the old
-   system continues running. The state of the old system is transferred
-   to the new system after sufficient testing of the later.
--  **Infrastructure Resource Model** ==(suggested by MT)== is identified
-   as: physical resources, virtualization facility resources and virtual
-   resources.
--  **Physical Resources** are the hardware of the infrastructure, may
-   also includes the firmware that enable the hardware.
--  **Virtual Resources** are resources provided as services built on top
-   of the physical resources via the virtualization facilities; in our
-   case, they are the components that VNF entities are built on, e.g.
-   the VMs, virtual switches, virtual routers, virtual disks etc
-   ==[MT] I don't think the VNF is the virtual resource. Virtual
-   resources are the VMs, virtual switches, virtual routers, virtual
-   disks etc. The VNF uses them, but I don't think they are equal. The
-   VIM doesn't manage the VNF, but it does manage virtual resources.==
--  **Visualization Facilities** are resources that enable the creation
-   of virtual environments on top of the physical resources, e.g.
-   hypervisor, OpenStack, etc.
-
-2.2 Upgrade Objects
-~~~~~~~~~~~~~~~~~~~
-
-2.2.1 Physical Resource
-^^^^^^^^^^^^^^^^^^^^^^^
-
-| Most of the cloud infrastructures support dynamic addition/removal of
-  hardware. A hardware upgrade could be done by removing the old
-  hardware node and adding the new one. This will not be in the scope of
-  this project.
-| ==[MT] Does this mean that we are excluding firmware upgrades too?==
-
-2.2.2 Virtual Resources
-^^^^^^^^^^^^^^^^^^^^^^^
-
-| Virtual resource upgrade mainly done by users. OPNFV may facilitate
-  the activity, but suggest to have it in long term roadmap instead of
-  initiate release.
-| ==[MT] same comment here: I don't think the VNF is the virtual
-  resource. Virtual resources are the VMs, virtual switches, virtual
-  routers, virtual disks etc. The VNF uses them, but I don't think they
-  are equal. For example if by some reason the hypervisor is changed and
-  the current VMs cannot be migrated to the new hypervisor, they are
-  incompatible, then the VMs need to be upgraded too. This is not
-  something the NFVI user (i.e. VNFs ) would even know about.==
-
-2.2.3 Virtualization Facility Resources
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| Based on the functionality they provide, virtualization facility
-  resources could be divided into computing node, networking node,
-  storage node and management node.
-| The possible upgrade objects in these nodes are addressed below:
-  (Note: hardware based virtualization may considered as virtualization
-  facility resource, but from escalator perspective, it is better
-  considered it as part of hardware upgrade. )
-
-**Computing node**
-
-#. OS Kernel
-#. Hypvervisor and virtual switch
-#. Other kernel modules, like driver
-#. User space software packages, like nova-compute agents and other
-   control plane programs
-
-| Updating 1 and 2 will cause the loss of virtualzation functionality of
-  the compute node, which may lead to data plane services interruption
-  if the virtual resource is not redudant.
-| Updating 3 might result the same.
-| Updating 4 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Networking node**
-
-#. OS kernel, optional, not all switch/router allow you to upgrade its
-   OS since it is more like a firmware than a generic OS.
-#. User space software package, like neutron agents and other control
-   plane programs
-
-| Updating 1 if allowed will cause a node reboot and therefore leads to
-  data plane services interruption if the virtual resource is not
-  redudant.
-| Updating 2 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Storage node**
-
-#. OS kernel, optional, not all storage node allow you to upgrade its OS
-   since it is more like a firmware than a generic OS.
-#. Kernel modules
-#. User space software packages, control plane programs
-
-| Updating 1 if allowed will cause a node reboot and therefore leads to
-  data plane services interruption if the virtual resource is not
-  redudant.
-| Update 2 might result in the same.
-| Updating 3 might lead to control plane services interruption if not an
-  HA deployment.
-
-**Management node**
-
-#. OS Kernel
-#. Kernel modules, like driver
-#. User space software packages, like database, message queue and
-   control plane programs.
-
-| Updating 1 will cause a node reboot and therefore leads to control
-  plane services interruption if not an HA deployment. Updating 2 might
-  result in the same.
-| Updating 3 might lead to control plane services interruption if not an
-  HA deployment.
-
-2.3 Upgrade Span
-~~~~~~~~~~~~~~~~
-
-| **Major Upgrade**
-| Upgrades between major releases may introducing significent changes in
-  function, configuration and data, such as the upgrade of OPNFV from
-  Arno to Brahmaputra.
-
-| **Minor Upgrade**
-| Upgrades inside one major releases which would not leads to changing
-  the stucture of the platform and may not infect the schema of the
-  system data.
-
-2.4 Upgrade Granularity
-~~~~~~~~~~~~~~~~~~~~~~~
-
-2.4.1 Physical/Hardware Dimension
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Support full / partial upgrade for data centre, cluster, zone. Because
-of the upgrade of a data centre or a zone, it may be divided into
-several batches. The upgrade of a cloud environment (cluster) may also
-be partial. For example, in one cloud environment running a number of
-VNFs, we may just try one of them to check the stability and
-performance, before we upgrade all of them.
-
-2.4.2 Software Dimension
-^^^^^^^^^^^^^^^^^^^^^^^^
-
--  The upgrade of host OS or kernel may need a 'hot migration'
--  The upgrade of OpenStack’s components
-    i.the one-shot upgrade of all components
-    ii.the partial upgrade (or bugfix patch) which only affects some
-   components (e.g., computing, storage, network, database, message
-   queue, etc.)
-
-| ==[MT] this section seems to overlap with 2.1.==
-| I can see the following dimensions for the software
-
--  different software packages
--  different funtions - Considering that the target versions of all
-   software are compatible the upgrade needs to ensure that any
-   dependencies between SW and therefore packages are taken into account
-   in the upgrade plan, i.e. no version mismatch occurs during the
-   upgrade therefore dependencies are not broken
--  same function - This is an upgrade specific question if different
-   versions can coexist in the system when a SW is being upgraded from
-   one version to another. This is particularly important for stateful
-   functions e.g. storage, networking, control services. The upgrade
-   method must consider the compatibility of the redundant entities.
-
--  different versions of the same software package
--  major version changes - they may introduce incompatibilities. Even
-   when there are backward compatibility requirements changes may cause
-   issues at graceful rollback
--  minor version changes - they must not introduce incompatibility
-   between versions, these should be primarily bug fixes, so live
-   patches should be possible
-
--  different installations of the same software package
--  using different installation options - they may reflect different
-   users with different needs so redundancy issues are less likely
-   between installations of different options; but they could be the
-   reflection of the heterogeneous system in which case they may provide
-   redundancy for higher availability, i.e. deeper inspection is needed
--  using the same installation options - they often reflect that the are
-   used by redundant entities across space
-
--  different distribution possibilities in space - same or different
-   availability zones, multi-site, geo-redundancy
-
--  different entities running from the same installation of a software
-   package
--  using different startup options - they may reflect different users so
-   redundancy may not be an issues between them
--  using same startup options - they often reflect redundant
-   entities====
-
-3.5 Upgrade duration
-~~~~~~~~~~~~~~~~~~~~
-
-As the OPNFV end-users are primarily Telco operators, the network
-services provided by the VNFs deployed on the NFVI should meet the
-requirement of 'Carrier Grade'.
-
-In telecommunication, a "carrier grade" or"carrier class" refers to a
-system, or a hardware or software component that is extremely reliable,
-well tested and proven in its capabilities. Carrier grade systems are
-tested and engineered to meet or exceed "five nines" high availability
-standards, and provide very fast fault recovery through redundancy
-(normally less than 50 milliseconds). [from wikipedia.org]
-
-"five nines" means working all the time in ONE YEAR except 5'15".
-
-We have learnt that a well prepared upgrade of OpenStack needs 10
-minutes. The major time slot in the outage time is used spent on
-synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
-' by Symantec]
-
-This 10 minutes of downtime of OpenStack however did not impact the
-users, i.e. the VMs running on the compute nodes. This was the outage of
-the control plane only. On the other hand with respect to the
-preparations this was a manually tailored upgrade specific to the
-particular deployment and the versions of each OpenStack service.
-
-The project targets to achieve a more generic methodology, which however
-requires that the upgrade objects fulfill ceratin requirements. Since
-this is only possible on the long run we target first upgrades from
-version to version for the different VIM services.
-
-**Questions:**
-
-#. | Can we manage to upgrade OPNFV in only 5 minutes?
-   | ==[MT] The first question is whether we have the same carrier grade
-     requirement on the control plane as on the user plane. I.e. how
-     much control plane outage we can/willing to tolerate?
-   | In the above case probably if the database is only half of the size
-     we can do the upgrade in 5 minutes, but is that good? It also means
-     that if the database is twice as much then the outage is 20
-     minutes.
-   | For the user plane we should go for less as with two release yearly
-     that means 10 minutes outage per year.==
-   | ==[Malla] 10 minutes outage per year to the users? Plus, if we take
-     control plane into the consideration, then total outage will be
-     more than 10 minute in whole network, right?==
-   | ==[MT] The control plane outage does not have to cause outage to
-     the users, but it may of course depending on the size of the system
-     as it's more likely that there's a failure that needs to be handled
-     by the control plane.==
-
-#. | Is it acceptable for end users ? Such as a planed service
-     interruption will lasting more than ten minutes for software
-     upgrade.
-   | ==[MT] For user plane, no it's not acceptable in case of
-     carrier-grade. The 5' 15" downtime should include unplanned and
-     planned downtimes.==
-   | ==[Malla] I go agree with Maria, it is not acceptable.==
-
-#. | Will any VNFs still working well when VIM is down?
-   | ==[MT] In case of OpenStack it seems yes. .:)==
-
-2.5.1 The maximum duration of an upgrade
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The duration of an upgrade is related to and proportional with the
-  scale and the complexity of the OPNFV platform as well as the
-  granularity (in function and in space) of the upgrade.
-| [Malla] Also, if is a partial upgrade like module upgrade, it depends
-  also on the OPNFV modules and their tight connection entites as well.
-
-2.5.2 The maximum duration of a rollback when an upgrade is failed - this should be about rollback duration
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The duration of a rollback is short than the corresponding upgrade. It
-  depends on the duration of restore the software and configue data from
-  pre-upgrade backup / snapshot.
-| ==[MT] During the upgrade process two types of failure may happen:
-|  In case we can recover from the failure by undoing the upgrade
-  actions it is possible to roll back the already executed part of the
-  upgrade in graceful manner introducing no more service outage than
-  what was introduced during the upgrade. Such a graceful rollback
-  requires typically the same amount of time as the executed portion of
-  the upgrade and impose minimal state/data loss.==
-| ==[MT] Requirement: It should be possible to roll back gracefully the
-  failed upgrade of stateful services of the control plane.
-|  In case we cannot recover from the failure by just undoing the
-  upgrade actions, we have to restore the upgraded entities from their
-  backed up state. In other terms the system falls back to an earlier
-  state, which is typically a faster recovery procedure than graceful
-  rollback and depending on the statefulness of the entities involved it
-  may result in significant state/data loss.==
-| **Two possible types of failures can happen during an upgrade**
-
-#. We can recover from the failure that occured in the upgrade process:
-   In this case, a graceful rolling back of the executed part of the
-   upgrade may be possible which would "undo" the executed part in a
-   similar fashion. Thus, such a roll back introduces no more service
-   outage during an upgrade than the executed part introduced. This
-   process typically requires the same amount of time as the executed
-   portion of the upgrade and impose minimal state/data loss.
-#. We cannot recover from the failure that occured in the upgrade
-   process: In this case, the system needs to fall back to an earlier
-   consistent state by reloading this backed-up state. This is typically
-   a faster recovery procedure than the graceful rollback, but can cause
-   state/data loss. The state/data loss usually depends on the
-   statefulness of the entities whose state is restored from the backup.
-
-2.5.3 The maximum duration of a VNF interruption
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| Since not the entire process of a smooth upgrade will affect the VNFs,
-  the duration of the VNF interruption may be shorter than the duration
-  of the upgrade. In some cases, the VNF running without the control
-  from of the VIM is acceptable.
-| ==[MT] Should require explicitly that the NFVI should be able to
-  provide its services to the VNFs independent of the control plane?==
-| ==[MT] Requirement: The upgrade of the control plane must not cause
-  interruption of the NFVI services provided to the VNFs.==
-| ==[MT] With respect to carrier-grade the yearly service outage of the
-  VNF should not exceed 5' 15" regardless whether it is planned or
-  unplanned outage. Considering the HA requirements TL-9000 requires an
-  ent-to-end service recovery time of 15 seconds based on which the ETSI
-  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
-  availability levels (SAL). The proposed example service recovery times
-  for these levels are:
-| SAL1: 5-6 seconds
-| SAL2: 10-15 seconds
-| SAL3: 20-25 seconds==
-| ==[Pva] my comment was actually that the downtime metrics of the
-  underlying elements, components and services are small fraction of the
-  total E2E service availability time. No-one on the E2E service path
-  will get the whole downtime allocation (in this context it includes
-  upgrade process related outages for the services provided by VIM etc.
-  elements that are subject to upgrade process).==
-| ==[MT] So what you are saying is that the upgrade of any entity
-  (component, service) shouldn't cause even this much service
-  interruption. This was the reason I brought these figures here as well
-  that they are posing some kind of upper-upper boundary. Ideally the
-  interruption is in the millisecond range i.e. no more than a
-  switchover or a live migration.==
-| ==[MT] Requirement: Any interruption caused to the VNF by the upgrade
-  of the NFVI should be in the sub-second range.==
-
-==[MT] In the future we also need to consider the upgrade of the NFVI,
-i.e. HW, firmware, hypervisors, host OS etc.==
-
-3. Functional Considerations
-----------------------------
-
-3.1 Requirement of Escalator's Basic Actions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This section describes the basic functions may required by Escalator.
-
-3.1.1 Preparation (offline)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This is the design phase when the upgrade plan (or upgrade campaign) is
-being designed so that it can be executed automatically with minimal
-service outage. It may include the following work:
-
-#. Check the dependencies of the software modules and their impact,
-   backward compatibilities to figure out the appropriate upgrade method
-   and ordering.
-#. Find out if a rolling upgrade could be planned with several rolling
-   steps to avoid any service outage due to the upgrade some
-   parts/services at the same time.
-#. Collect the proper version files and check the integration for
-   upgrading.
-#. The preparation step should produce an output (i.e. upgrade
-   campaign/plan), which is executable automatically in an NFV Famawork
-   and which can be validated before execution.
-
-   -  The upgrade campaign should not be referring to scalable entities
-      directly, but allow for adaptation to the system configuration and
-      state at any given moment.
-   -  The upgrade campaign should describe the ordering of the upgrade
-      of different entities so that dependencies, redundancies can be
-      maintained during the upgrade execution
-   -  The upgrade campaign should provide information about the
-      applicable recovery procedures and their ordering.
-   -  The upgrade campaign should consider information about the
-      verification/testing procedures to be performed during the upgrade
-      so that upgrade failures can be detected as soon as possible and
-      the appropriate recovery procedure can be identified and applied.
-   -  The upgrade campaign should provide information on the expected
-      execution time so that hanging execution can be identified
-   -  The upgrade campaign should indicate any point in the upgrade when
-      coordination with the users (VNFs) is required.
-
-==[hujie]Depends on the attributes of the object being upgraded, the
-upgrade plan may be slitted into step(s) and/or sub-plan(s), and even
-more small sub-plans in design phase. The plan(s) or sub-plan(s) my
-include step(s) or sub-plan(s).==
-
-3.1.2 Validation the upgrade plan / Checking the pre-requisites of System( offline / online)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The upgrade plan should be validated before the execution by testing
-  it in a test environment which is similar to the product environment.
-| ==[MT]However it could also mean that we can identify some properties
-  that it should satisfy e.g. what operations can or cannot be executed
-  simultaneously like never take out two VMs of the same VNF.
-| Another question is if it requires that the system is in a particular
-  state when the upgrade is applied. I.e. if there's certain amount of
-  redundacy in the system, migration is enabled for VMs, when the NFVI
-  is upgraded the VIM is healthy, when the VIM is upgraded the NFVI is
-  healthy, etc.
-| I'm not sure what online validation means: Is it the validation of the
-  upgrade plan/campaign or the validation of the system that it is in a
-  state that the upgrade can be performed without too much risk?==
-
-| Before the upgrade plan being executed, the system heathly of the
-  online product environment should be checked and confirmed to satisfy
-  the requirements which were described in the upgrade plan. The
-  sysinfo, e.g. which included system alarms, performance statistics and
-  diagnostic logs, will be collected and analyized. It is required to
-  resolve all of the system faults or exclud the unhealthy part before
-  executing the upgrade plan.
-| ==[hujie] Text merged.==
-
-3.1.3 Backup/Snapshot (online)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-For avoid loss of data when a unsuccessful upgrade was encountered, the
-data should be backuped and the system state snapshot should be taken
-before the excution of upgrade plan. This would be considered in the
-upgrade plan.
-
-Several backups/Snapshots may be generated and stored before the single
-steps of changes. The following data/files are required to be
-considered:
-
-#. running version files for each node.
-#. system components' configuration file and database.
-#. image and storage, if it is necessary.
-   ==[MT] Does 3 imply VNF image and storage? I.e. VNF state and data?==
-
-| ==[hujie] The following text is derived from previous "4. Negotiate
-  with the VNF if it's ready for the upgrade"==
-| Although the upper layer, which include VNFs and VNFMs, is out of the
-  scope of Escalator, but it is still recommended to let it ready for a
-  smooth system upgrade. The escalator could not garanttee the safe of
-  VNFs. The upper layer should have some safe guard mechanism in design,
-  and ready for avoiding failure in system upgrade.
-
-3.1.4 Execution (online)
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-| The execution of upgrade plan should be a dynamical procedure which is
-  controlled by Escalator.
-| ==[hujie] Revised text to be general.==
-
-#. It is required to supporting execution ether in sequence or in
-   parallel.
-#. It is required to checke the result of the execution and take the
-   action according the situation and the policies in the upgrade plan.
-#. It is required to execute properly on various configurations of
-   system object. I.e. stand-alone, HA, etc.
-#. It is required to excecute on the designated different parts of the
-   system. I.e. physical server, virtualized server, rack, chassis,
-   cluster, even different geographical places.
-
-3.1.5 Testing (online)
-^^^^^^^^^^^^^^^^^^^^^^
-
-| The testing after upgrade the whole system or parts of system to make
-  sure the upgraded system(object) is working normally.
-| ==[hujie] Revised text to be general.==
-
-#. It is recommended to run the prepared test cases to see if the
-   functionalities are availiable without any problem.
-#. It is recommended to check the sysinfo, e.g. system alarms,
-   performance statistics and diagnostic logs to see if there are any
-   abnormal.
-
-3.1.6 Restore/Rollback (online)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| When upgrade is failure unfortunatly, a quick system restore or system
-  rollback should be taken to recovery the system and the services.
-| ==[hujie] Revised text to be general.==
-
-#. It is recommend to support system restore from backup when upgrade
-   was failed.
-#. It is recommend to support gracefull rollback with reverse order
-   steps if possible.
-
-3.1.7 Monitoring (online)
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-| Escalator should continually monitor the process of upgrade. It is
-  keeping update status of each module, each node, each cluster into a
-  status table during upgrade.
-| ==[hujie] Revised text to be general.==
-
-#. It is required to collect the status of every objects being upgraded
-   and sending abnormal alerms during the upgrade.
-#. It is recommend to reuse the existing monitoring system, like alarm.
-#. It is recommend to support pro-actively query.
-#. It is recommend to support passively wait for notification.
-
-| **Two possible ways for monitoring:**
-| **Pro-Actively Query** requires NFVI/VIM provides proper API or CLI
-  interface. If Escalator serves as a service, it should pass on these
-  interfaces.
-| **Passively Wait for Notification** requires Escalator provides
-  callback interface, which could be used by NFVI/VIM systems or upgrade
-  agent to send back notification.
-| [hujie] I am not sure why not to subscribe the notification.
-
-3.1.8 Logging (online)
-^^^^^^^^^^^^^^^^^^^^^^
-
-Record the information generated by escalator into log files. The log
-file is used for manual diagnostic of exceptions.
-
-#. It is required to support logging.
-#. It is recommended to include time stamp, object id, action name,
-   error code, etc.
-
-3.1.9 Administrative Control (online)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Administrative Control is used for control the privilege to start any
-escalator's actions for avoding unauthorized operations.
-
-#. It is required to support administrative control mechenism
-#. It is recommed to reuse the system's own secure system.
-#. It is required to avoid conflicts when the system's own secure system
-   being upgraded.
-
-3.2 Requirements on system object being upgraded
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-| ==We can develope BPs in future from req of this section and GA for
-  upper stream projects==
-| Escalator focus on smooth upgrade. In practical implementation, it
-  might be combined with installer/deployer, or act as an independent
-  tool/service. In either way, it requires targeting systems(NFVI and
-  VIM) are developed/deployed in a way that Escalator could perform
-  upgrade on them.
-
-On NFVI system, live-migration is likely used to maintain availability
-because OPNFV would like to make HA transparent from end user. This
-requires VIM system being able to put compute node into maintenance mode
-and then isolated from normal service. Otherwise, new NFVI instances
-might risk at being schedule into the upgrading node.
-
-| On VIM system, availability is likely achieved by redundancy. This
-  impose less requirements on system/services being upgrade (see PVA
-  comments in early version). However, there should be a way to put the
-  target system into standby mode. Because starting upgrade on the
-  master node in a cluster is likely a bad idea.
-| ==[hujie] Revised text to be general.==
-
-#. It is required for NFVI/VIM to support **service handover** mechanism
-   that minimize interruption to 0.001%(i.e. 99.999% service
-   availability). Possible implementations are live-migration, redundant
-   deployment, etc, (Note: for VIM, interruption could be less
-   restrictive)
-#. It is required for NFVI/VIM to restore the early verion in a efficent
-   way, such as **snapshot**.
-#. It is required for NFVI/VIM to **migration data** efficiently between
-   base and upgraded system.
-   ==[hujie] What is exact meaning of "base" here?==
-#. It is recomend for NFV/VIM's interface to support upgrade
-   orchestration, e.g. reading/setting system state
-   ==[hujie] I am not sure if it reflect the previous text.==
-
-4. Use Cases
-------------
-
-This section describes the use cases to verify the requirements of
-Escalator.
-
-4.1 Upgrade a system with minimal configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-A minimal configuration system is normally depolyed for experimental or
-developement ussage, such as a OPNFV test bed. Althouth it dose not have
-large workload, but it is a typical system to be upgraded frequently.
-
-4.2 Upgrade a system with HA configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-A HA configuration system is very popular in the operator's data centre.
-And it is a typical product environment. It always running 7 \* 24 a
-week with VNFs running on it to provide services to the end users.
-
-4.3 Upgrade a system with Multi-Site configuration
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Upgrade in one site may cause service interruption to other site, if
-both sites are depended and sharing the same modules/data base (e.g. a
-keystone for both sites).
-
-If a site failure during an upgrade, the rollback missing any minimal
-state/data loss can cause an affect/failure to the depended site.
-
-==Consider one site of ARNO release first. Then, multi-site in the
-future.==
-
-5. RA of Escalator
-------------------
-
-This section describes the reference architecture, the function blocks,
-the function entities of Escalator for the reader to well understand how
-the basic functions be organized.
-
-6. Information Flows
---------------------
-
-| This section describes the information flows among the function
-  entities when Escalator is in actions.
-| We should consider a generic procedure / frameworks of upgrading. And
-  may provide a plugin interface for specialized tasks
-
-7. Interfaces
--------------
-
-This section describes the required interfaces of Escalator.
-
-7.1 Manual Interface (CLI / GUI)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-7.2 RESTful API
-~~~~~~~~~~~~~~~
-
-To support 3.3 Negotiate with the VNF if it's ready for the upgrade
-
-7.3 Configuration File
-~~~~~~~~~~~~~~~~~~~~~~
-
-This section will suggest a format of the configuration files and how to
-deal with it.
-
-7.4 Log File
-------------
-
-This section will suggest a format of the log files and how to deal with
-it.
-
-8. Requirements from other OPNFV projects
------------------------------------------
-
-| We have created a questionnaire for collecting other projects
-  requirments
-  (https://docs.google.com/forms/d/11o1mt15zcq0WBtXYK0n6lKF8XuIzQTwvv8ePTjmcoF0/viewform?usp=send_form),
-  please advertise it.
-| ==[hujie] Can we force other OPNFV projects to complete the survey by
-  using JIRA dependence?==
-
-8.1 Doctor Project
-~~~~~~~~~~~~~~~~~~
-
-| ==Note: This scenario could be out of scope in Escalator project, but
-  having the option to support this should be better to align with
-  Doctor requirements.==
-| The scope of Doctor project also covers maintenance scenario in which
-  1) the VIM administorator requests host maintenance to VIM, 2) VIM
-  will notifiy it to consumer such as VNFM to trigger application level
-  migration or switching active-standby nodes, and 3) VIM waits responce
-  from the consumer for a short while.
-
--  VIM should send out notification of VM migration to consumer (VNFM)
-   as abstracted message like "maintenance".
--  VIM could wait VM migration until it receives "VM ready to
-   maintenance" message from the owner (VNFM)
-
-8.2 HA Project
-~~~~~~~~~~~~~~
-
-8.3 Multi-site Project
-~~~~~~~~~~~~~~~~~~~~~~
-
--  Escalator upgrade one site should at least not lead to the other site
-   API token validation failed.
-
-9. Reference
-------------
-
-| [1] ETSI GS NFV 002 (V1.1.1): “Architectural Framework”
-| [2] ETSI GS NFV 003 (V1.1.1): "Terminology for Main Concepts in NFV".
-| [3] ETSI GS NFV-SWA001:“Virtual Network Function Architecture”
-| [4] ETSI GS NFV-MAN001:“Management and Orchestration”
-| [5] ETSI GS NFV-REL001:"Resiliency Requirements"
-| [6] QuEST Forum TL-9000:"Quality Management System Requirement
-  Handbook"
-| [7] Service Availabilty Forum AIS:"Software Management Framework"
-
-10. Useful Working Drafts of ETSI NFV
--------------------------------------
-
-| Access them with your own ETSI account, please DO NOT disclose the
-  content.
-| [1] Migrate Virtualised Compute Resource operation @ 7.3.1.8
-| ftp://docbox.etsi.org/ISG/NFV/Open/Drafts/IFA005_Or-Vi_ref_point_Spec/NFV-IFA005v070.zip
-| [2] Reliability issues during NFV Software upgrade and improvement
-  mechanisms @ 8
-| ftp://@docbox.etsi.org/ISG/NFV/Open/Drafts/REL003_E2E_reliability_models/NFV-REL003v030.zip
-
-Appendix
---------
-
-A.1 Impact Analysis
-~~~~~~~~~~~~~~~~~~~
-
-Upgrading the different software modules may cause different impact on
-the availability of the infrastracture resources and even on the service
-continuity of the vNFs.
-
-**Software modules in the computing nodes**
-
-#. Host OS patch
-   ==[MT] As SW module, we should list the host OS and maybe ====its
-   drivers as well. From upgrade persepctive do we limit host OS
-   upgrades to patches only?==
-#. Hypervisor, such as KVM, QEMU, XEN, libvirt
-#. Openstack agent in computing nodes (like Nova agent, Ceilometer
-   agent...)
-
-**Software modules in network nodes**
-
-#. Neutron L2/L3 agent
-#. OVS, SR-IOV Driver
-
-**Software modules storage nodes**
-
-#. Ceph
-
-The table below analyses such an impact - considering a single instance
-of each software module - from the following aspects:
-
--  the function which will be lost during upgrade,
--  the duration of the loss of this specific function,
--  if this causes the loss of the vNF function,
--  if it causes incompatibility in the different parts of the software,
--  what should be backed up before the upgrade,
--  the duration of restoration time if the upgrade fails
-
-| These values provided come from internal testing and based on some
-  assumptions, they may vary depending on the deployment techniques.
-  Please feel free to add if you find more efficient values during your
-  testing.
-| https://wiki.opnfv.org/_media/upgrade_analysis_v0.5.xlsx
-| Note that no redundancy of the software modules is considered in the
-  table.