4 files changed, 133 insertions, 270 deletions
diff --git a/docs/requirements/102-Terminologies.rst b/docs/requirements/102-Terminologies.rst
index 221196b..f065bca 100644
--- a/docs/requirements/102-Terminologies.rst
+++ b/docs/requirements/102-Terminologies.rst
@@ -5,30 +5,59 @@ Terminology
 Terminologies
 =============
 
-Operator
-  The term refers to network service providers and Virtual Network
-  Function (VNF) providers.
+Backup
+  The term refers to making a copy of the system persistent data to a storage,
+  so that it can be used to restore the system or a given part of it to the same
+  state as it was when the backup was created. Restoring from backup will lose
+  volatile states like CPU and memory content. Changes made to the system from
+  the moment the backup was created to the moment it is used to restore the
+  (sub)system are also lost in the restoration process.
+
+Carrier Grade
+  The refers to a system, or a hardware or software component that is extremely
+  reliable, well tested and proven in its capabilities. Carrier grade systems are
+  tested and engineered to meet or exceed "five nines" high availability standards,
+  and provide very fast fault recovery through redundancy (normally less than 50
+  milliseconds). Sometimes, Carrier grade is also referred as Carrier Class.
+
+Downgrade
+  The term refers to an upgrade operation in which an earlier version of the
+  software is restored through the upgrade procedure. Compared to rollback,
+  Downgrade is normally initiated with Operator, and it is allowed to select any
+  earlier version, providing the compatibility of the versions is met or upgrade
+  strategies are allowed (whether service outage or data lost can be tolerant.)
 
 End-User
   The term refers to a subscriber of the Operator's services.
 
-Network Service
-  The term refers to a service provided by an Operator to its
-  end-users using a set of (virtualized) Network Functions
+High Availability(HA)
+  High Availability refers to a system or component that is continuously
+  operational for a desirably long length of time even a part of it is out of
+  service. Carrier Grade Availability is a typical HA example. HA system is popular
+  in Operator's data center for critical tasks. Non-HA system is normally deployed
+  for experimental or in-critical tasks in favor of its simplicity.
 
 Infrastructure Services
   The term refers to services provided by the NFV Infrastructure to the VNFs
   as required by the Management & Orchestration functions and especially the VIM.
   I.e. these are the virtual resources as perceived by the VNFs.
 
-Smooth Upgrade
-  The term refers to an upgrade that results in no service outage
-  for the end-users.
+Infrastructure Resource Model
+  The term refers to the representation of infrastructure resources,
+  namely: the physical resources, the virtualization
+  facility resources and the virtual resources.
 
-Rolling Upgrade
-  The term refers to an upgrade strategy, which upgrades a node or a subset
-  of nodes at a time in a wave style rolling through the data centre. It
-  is a popular upgrade strategy to maintain service availability.
+Network Service
+  The term refers to a service provided by an Operator to its
+  end-users using a set of (virtualized) Network Functions
+
+Operator
+  The term refers to network service providers and Virtual Network
+  Function (VNF) providers.
+
+Outage
+  The terms refers to the period of time when a given service is not available
+  to End-Users.
 
 Parallel Universe Upgrade
   The term refers to an upgrade strategy, which creates and deploys
@@ -36,25 +65,42 @@ Parallel Universe Upgrade
   system continues running. The state of the old system is transferred
   to the new system after sufficient testing of the new system.
 
-Infrastructure Resource Model
-  The term refers to the representation of infrastructure resources,
-  namely: the physical resources, the virtualization
-  facility resources and the virtual resources.
-
 Physical Resource
   The term refers to a piece of hardware in the NFV infrastructure that may
   also include firmware enabling this piece of hardware.
 
-Virtual Resource
-  The term refers to a resource, which is provided as services built on top
-  of the physical resources via the virtualization facilities; in particular,
-  virtual resources are the resources on which VNFs are deployed. Examples of
-  virtual resources are: VMs, virtual switches, virtual routers, virtual disks.
+Restore
+  The term refers to a failure handling strategy that reverts the changes
+  done, for example, by an upgrade by restoring the system from some backup
+  data. This results in the loss of any change and data persisted after the
+  backup was been taken. To recover those additional measures need to be taken
+  if necessary (e.g. Rollforward).
 
-Visualization Facility
-  The term refers to a resource that enables the creation
-  of virtual environments on top of the physical resources, e.g.
-  hypervisor, OpenStack, etc.
+Rollback
+  The term refers to a failure handling strategy that reverts the changes
+  done by a potentially failed upgrade execution one by one in a reverse order.
+  I.e. it is like undoing the changes done by the upgrade.
+
+Rollforward
+  The term refers to a failure handling strategy applied after a restore
+  (from a backup) operation to recover any loss of data persisted between
+  the time the backup has been taken and the moment it is restored. Rollforward
+  requires that data that needs to survive the restore operation is logged at
+  a location not impacted by the restore so that it can be re-applied to the
+  system after its restoration from the backup.
+
+Rolling Upgrade
+  The term refers to an upgrade strategy, which upgrades a node or a subset
+  of nodes at a time in a wave style rolling through the data centre. It
+  is a popular upgrade strategy to maintain service availability.
+
+Smooth Upgrade
+  The term refers to an upgrade that results in no service outage
+  for the end-users.
+
+Snapshot
+  The term refer to the state of a system at a particular point in time, or
+  the action of capturing such a state.
 
 Upgrade Campaign
   The term refers to a choreography that describes how the upgrade should
@@ -69,48 +115,18 @@ Upgrade Duration
   upgrade campaign has started until it has been committed. Depending on
   the upgrade strategy, the state of the configuration and the upgrade target
   some parts of the system may be in a more vulnerable state with respect to
-  service availbility.
-
-Outage
-  The period of time during which a given service is not provided is referred
-  as the outage of that given service. If a subsystem or the entire system
-  does not provide any service, it is the outage of the given subsystem or the
-  system. Smooth upgrade means upgrade with no outage for the user plane, i.e.
-  no VNF should experience service outage.
-
-Rollback
-  The term refers to a failure handling strategy that reverts the changes
-  done by a potentially failed upgrade execution one by one in a reverse order.
-  I.e. it is like undoing the changes done by the upgrade.
+  service availability.
 
-Backup
-  The term refers to data persisted to a storage, so that it can be used to
-  restore the system or a given part of it in the same state as it was when the
-  backup was created assuming a cold restart. Changes made to the system from
-  the moment the backup was created till the moment it is used to restore the
-  (sub)system are lost in the restoration process.
-
-Restore
-  The term refers to a failure handling strategy that reverts the changes
-  done, for example, by an upgrade by restoring the system from some backup
-  data. This results in the loss of any change and data persisted after the
-  backup was been taken. To recover those additional measures need to be taken
-  if necessary (e.g. rollforward).
-
-Rollforward
-  The term refers to a failure handling strategy applied after a restore
-  (from a backup) opertaion to recover any loss of data persisted between
-  the time the backup has been taken and the moment it is restored. Rollforward
-  requires that data that needs to survive the restore operation is logged at
-  a location not impacted by the restore so that it can be re-applied to the
-  system after its restoration from the backup.
+Virtualization Facility
+  The term refers to a resource that enables the creation
+  of virtual environments on top of the physical resources, e.g.
+  hypervisor, OpenStack, etc.
 
-Downgrade
-  The term refers to an upgrade in which an earlier version of the software
-  is restored through the upgrade procedure. A system can be downgraded to any
-  earlier version and the compatibility of the versions will determine the
-  applicable upgrade strategies and whether service outage can be avoided.
-  In particular any data conversion needs special attention.
+Virtual Resource
+  The term refers to a resource, which is provided as services built on top
+  of the physical resources via the virtualization facilities; in particular,
+  virtual resources are the resources on which VNFs are deployed. Examples of
+  virtual resources are: VMs, virtual switches, virtual routers, virtual disks.
 
 Abbreviations
 =============
@@ -126,4 +142,3 @@ VIM
   sometimes it is also referred as control plane in this document.
   The VIM controls and manages the NFVI compute, network and storage
   resources to provide the required virtual resources to the VNFs.
-
diff --git a/docs/requirements/104-Requirements.rst b/docs/requirements/104-Requirements.rst
index b6e7f57..3dd66dc 100644
--- a/docs/requirements/104-Requirements.rst
+++ b/docs/requirements/104-Requirements.rst
@@ -5,180 +5,42 @@ Requirements
 Upgrade duration
 ================
 
-As the OPNFV end-users are primarily Telecom operators, the network
-services provided by the VNFs deployed on the NFVI should meet the
-requirement of 'Carrier Grade'.::
-
-  In telecommunication, a "carrier grade" or"carrier class" refers to a
-  system, or a hardware or software component that is extremely reliable,
-  well tested and proven in its capabilities. Carrier grade systems are
-  tested and engineered to meet or exceed "five nines" high availability
-  standards, and provide very fast fault recovery through redundancy
-  (normally less than 50 milliseconds). [from wikipedia.org]
-
-"five nines" means working all the time in ONE YEAR except 5'15".
-
-::
-
-  We have learnt that a well prepared upgrade of OpenStack needs 10
-  minutes. The major time slot in the outage time is used spent on
-  synchronizing the database. [from ' Ten minutes OpenStack Upgrade? Done!
-  ' by Symantec]
-
-This 10 minutes of downtime of the OpenStack services however did not impact the
-users, i.e. the VMs running on the compute nodes. This was the outage of
-the control plane only. On the other hand with respect to the
-preparations this was a manually tailored upgrade specific to the
-particular deployment and the versions of each OpenStack service.
-
-The project targets to achieve a more generic methodology, which however
-requires that the upgrade objects fulfil certain requirements. Since
-this is only possible on the long run we target first the upgrade
-of the different VIM services from version to version.
-
-**Questions:**
-
-1. Can we manage to upgrade OPNFV in only 5 minutes?
-
-.. <MT> The first question is whether we have the same carrier grade
-   requirement on the control plane as on the user plane. I.e. how
-   much control plane outage we can/willing to tolerate?
-   In the above case probably if the database is only half of the size
-   we can do the upgrade in 5 minutes, but is that good? It also means
-   that if the database is twice as much then the outage is 20
-   minutes.
-   For the user plane we should go for less as with two release yearly
-   that means 10 minutes outage per year.
-
-.. <Malla> 10 minutes outage per year to the users? Plus, if we take
-   control plane into the consideration, then total outage will be
-   more than 10 minute in whole network, right?
-
-.. <MT> The control plane outage does not have to cause outage to
-   the users, but it may of course depending on the size of the system
-   as it's more likely that there's a failure that needs to be handled
-   by the control plane.
-
-2. Is it acceptable for end users ? Such as a planed service
-   interruption will lasting more than ten minutes for software
-   upgrade.
-
-.. <MT> For user plane, no it's not acceptable in case of
-   carrier-grade. The 5' 15" downtime should include unplanned and
-   planned downtimes.
-
-.. <Malla> I go agree with Maria, it is not acceptable.
-
-3. Will any VNFs still working well when VIM is down?
-
-.. <MT> In case of OpenStack it seems yes. .:)
-
-The maximum duration of an upgrade
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The duration of an upgrade is related to and proportional with the
-scale and the complexity of the OPNFV platform as well as the
-granularity (in function and in space) of the upgrade.
-
-.. <Malla> Also, if is a partial upgrade like module upgrade, it depends
-  also on the OPNFV modules and their tight connection entities as well.
-
-.. <MT> Since the maintenance window is shrinking and becoming non-existent
-  the duration of the upgrade is secondary to the requirement of smooth upgrade.
-  But probably we want to be able to put a time constraint on each upgrade
-  during which it must complete otherwise it is considered failed and the system
-  should be rolled back. I.e. in case of automatic execution it might not be clear
-  if an upgrade is long or just hanging. The time constraints may be a function
-  of the size of the system in terms of the upgrade object(s).
-
-The maximum duration of a roll back when an upgrade is failed
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The duration of a roll back is short than the corresponding upgrade. It
-depends on the duration of restore the software and configure data from
-pre-upgrade backup / snapshot.
-
-.. <MT> During the upgrade process two types of failure may happen:
-  In case we can recover from the failure by undoing the upgrade
-  actions it is possible to roll back the already executed part of the
-  upgrade in graceful manner introducing no more service outage than
-  what was introduced during the upgrade. Such a graceful roll back
-  requires typically the same amount of time as the executed portion of
-  the upgrade and impose minimal state/data loss.
-
-.. <MT> Requirement: It should be possible to roll back gracefully the
-  failed upgrade of stateful services of the control plane.
-  In case we cannot recover from the failure by just undoing the
-  upgrade actions, we have to restore the upgraded entities from their
-  backed up state. In other terms the system falls back to an earlier
-  state, which is typically a faster recovery procedure than graceful
-  roll back and depending on the statefulness of the entities involved it
-  may result in significant state/data loss.
-
-.. <MT> Two possible types of failures can happen during an upgrade
-
-.. <MT> We can recover from the failure that occurred in the upgrade process:
-  In this case, a graceful rolling back of the executed part of the
-  upgrade may be possible which would "undo" the executed part in a
-  similar fashion. Thus, such a roll back introduces no more service
-  outage during an upgrade than the executed part introduced. This
-  process typically requires the same amount of time as the executed
-  portion of the upgrade and impose minimal state/data loss.
-
-.. <MT> We cannot recover from the failure that occurred in the upgrade
-   process: In this case, the system needs to fall back to an earlier
-   consistent state by reloading this backed-up state. This is typically
-   a faster recovery procedure than the graceful roll back, but can cause
-   state/data loss. The state/data loss usually depends on the
-   statefulness of the entities whose state is restored from the backup.
-
-The maximum duration of a VNF interruption (Service outage)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Since not the entire process of a smooth upgrade will affect the VNFs,
-the duration of the VNF interruption may be shorter than the duration
-of the upgrade. In some cases, the VNF running without the control
-from of the VIM is acceptable.
-
-.. <MT> Should require explicitly that the NFVI should be able to
-  provide its services to the VNFs independent of the control plane?
-
-.. <MT> Requirement: The upgrade of the control plane must not cause
-  interruption of the NFVI services provided to the VNFs.
-
-.. <MT> With respect to carrier-grade the yearly service outage of the
-  VNF should not exceed 5' 15" regardless whether it is planned or
-  unplanned outage. Considering the HA requirements TL-9000 requires an
-  end-to-end service recovery time of 15 seconds based on which the ETSI
-  GS NFV-REL 001 V1.1.1 (2015-01) document defines three service
-  availability levels (SAL). The proposed example service recovery times
-  for these levels are:
-
-.. <MT> SAL1: 5-6 seconds
-
-.. <MT> SAL2: 10-15 seconds
-
-.. <MT> SAL3: 20-25 seconds
-
-.. <Pva> my comment was actually that the downtime metrics of the
-  underlying elements, components and services are small fraction of the
-  total E2E service availability time. No-one on the E2E service path
-  will get the whole downtime allocation (in this context it includes
-  upgrade process related outages for the services provided by VIM etc.
-  elements that are subject to upgrade process).
-
-.. <MT> So what you are saying is that the upgrade of any entity
-  (component, service) shouldn't cause even this much service
-  interruption. This was the reason I brought these figures here as well
-  that they are posing some kind of upper-upper boundary. Ideally the
-  interruption is in the millisecond range i.e. no more than a
-  switch-over or a live migration.
-
-.. <MT> Requirement: Any interruption caused to the VNF by the upgrade
-  of the NFVI should be in the sub-second range.
-
-.. <MT]> In the future we also need to consider the upgrade of the NFVI,
-  i.e. HW, firmware, hypervisors, host OS etc.
+Being a telecom service system, OPNFV shall target at carrier grade availability,
+which allows only about 5 minutes of outage in a year. Base on this basic input
+and discussions on the current solutions, The following requirements are defined
+from the perspective of time constraints:
+
+- OPNFV platform must be deployed with HA to allow live upgrade possible. Considering of
+  the scale, complexity, and life cycle of OPNFV system, allocating less than
+  5 minutes out of a year for upgrade is in-realistic. Therefore OPNFV should
+  be deployed with HA, allowing part of system being upgraded, while its
+  redundant parts continue to serve End-User. This hopefully relieves the time
+  constraint on upgrade operation to achievable level.
+
+- VNF service interruption for each switching should be sub-second range. In
+  HA system, switching from an in-service system/component to the redundant
+  ones normally cause service interruption. From example live-migrating a
+  virtual machine from one hypervisor to another typically take the virtual
+  machine out of service for about 500ms. Summing up all these interruptions in
+  a year shall be less than 5 minutes in order to fulfill the five-nines carrier
+  grade availability. In addition, when interruption goes over a second, End-User
+  experience is likely impacted. This document therefore recommends service
+  switching should be less than a second.
+
+- VIM interruption shall not result in NFVI interruption. VIM in general has more
+  logic built-in, therefore more complicated, and likely less reliable than NFVI.
+  To minimize the impact from VIM to NFVI, unless VIM explicitly order NFVI stop
+  functioning, NFVI shall continue working as it should.
+
+- Total upgrade duration should be less than 2 hours. Even time constraint is
+  relieved with HA design, the total time for upgrade operation is recommended
+  to limit in 2 hours. The reason is that upgrade might interfere End-User
+  unexpectedly, shorter maintenance window is less possible risk. In this
+  document, upgrade duration is started at the moment that End-User services
+  are possibly impacted to the moment that upgrade is concluded with either
+  commit or rollback. Regarding on the scale and complexity of OPNFV system,
+  this requirements looks challenging, however OPNFV implementations should
+  target this with introducing novel designs and solutions.
 
 Pre-upgrading Environment
 =========================
diff --git a/docs/requirements/105-Use_Cases.rst b/docs/requirements/105-Use_Cases.rst
index 9f13110..9183f0b 100644
--- a/docs/requirements/105-Use_Cases.rst
+++ b/docs/requirements/105-Use_Cases.rst
@@ -5,29 +5,6 @@ Use Cases
 This section describes the use cases in different system configuration
 to verify the requirements of Escalator.
 
-System Configurations
-=====================
-
-HA configuration
-^^^^^^^^^^^^^^^^
-
-A HA configuration system is very popular in the operator's data centre.
-It is a typical product environment. It is always running 7\*24 with VNFs
-running on it to provide services to the end users.
-
-
-Non-HA configuration
-^^^^^^^^^^^^^^^^^^^^
-
-A non-HA configuration system is normally deployed for experimental or
-development usages, such as a Vagrant/VM environment.
-
-Escalator supports the upgrade system in this configuration, but it may
-not guarantee a smooth upgrade.
-
-Use cases
-=========
-
 Use case #1: Smooth upgrade in a HA configuration
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 For a system with HA configuration, the operator can use Escalator to
diff --git a/docs/requirements/etc/conf.py b/docs/requirements/etc/conf.py
index 0066035..c933038 100644
--- a/docs/requirements/etc/conf.py
+++ b/docs/requirements/etc/conf.py
@@ -1,6 +1,17 @@
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import datetime
-import sys
-import os
 
 try:
     __import__('imp').find_module('sphinx.ext.numfig')
@@ -20,9 +31,7 @@ html_use_index = False
 
 pdf_documents = [('index', u'OPNFV', u'OPNFV Project', u'OPNFV')]
 pdf_fit_mode = "shrink"
-pdf_stylesheets = ['sphinx','kerning','a4']
-#latex_domain_indices = False
-#latex_use_modindex = False
+pdf_stylesheets = ['sphinx', 'kerning', 'a4']
 
 latex_elements = {
     'printindex': '',