summaryrefslogtreecommitdiffstats
path: root/Section6_VNF_HA.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Section6_VNF_HA.rst')
-rw-r--r--Section6_VNF_HA.rst329
1 files changed, 0 insertions, 329 deletions
diff --git a/Section6_VNF_HA.rst b/Section6_VNF_HA.rst
deleted file mode 100644
index afc84ac..0000000
--- a/Section6_VNF_HA.rst
+++ /dev/null
@@ -1,329 +0,0 @@
-=======================
-6 VNF High Availability
-=======================
-
-
-************************
-6.1 Service Availability
-************************
-
-In the context of NFV, Service Availability refers to the End-to-End (E2E) Service
-Availability which includes all the elements in the end-to-end service (VNFs and
-infrastructure components) with the exception of the customer terminal such as
-handsets, computers, modems, etc. The service availability requirements for NFV
-should be the same as those for legacy systems (for the same service).
-
-Service Availability =
-total service available time /
-(total service available time + total service recovery time)
-
-The service recovery time among others depends on the number of redundant resources
-provisioned and/or instantiated that can be used for restoring the service.
-
-In the E2E relation a Network Service is available only of all the necessary
-Network Functions are available and interconnected appropriately to collaborate
-according to the NF chain.
-
-General Service Availability Requirements
-=========================================
-
-* We need to be able to define the E2E (V)NF chain based on which the E2E availability
- requirements can be decomposed into requirements applicable to individual VNFs and
- their interconnections
-* The interconnection of the VNFs should be logical and be maintained by the NFVI with
- guaranteed characteristics, e.g. in case of failure the connection should be
- restored within the acceptable tolerance time
-* These characteristics should be maintained in VM migration, failovers and switchover,
- scale in/out, etc. scenarios
-* It should be possible to prioritize the different network services and their VNFs.
- These priorities should be used when pre-emption policies are applied due to
- resource shortage for example.
-* VIM should support policies to prioritize a certain VNF.
-* VIM should be able to provide classified virtual resources to VNFs in different SAL
-
-6.1.1 Service Availability Classification Levels
-================================================
-
-The [ETSI-NFV-REL_] defined three Service Availability Levels
-(SAL) are classified in Table 1. They are based on the relevant ITU-T recommendations
-and reflect the service types and the customer agreements a network operator should
-consider.
-
-.. [ETSI-NFV-REL] `ETSI GS NFV-REL 001 V1.1.1 (2015-01) <http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf>`_
-
-
-*Table 1: Service Availability classification levels*
-
-+-------------+-----------------+-----------------------+---------------------+
-|SAL Type | Customer Type | Service/Function | Notes |
-+=============+=================+=======================+=====================+
-|Level 1 | Network Operator| * Intra-carrier | Sub-levels within |
-| | Control Traffic | engineering | Level 1 may be |
-| | | traffic | created by the |
-| | Government/ | * Emergency | Network Operator |
-| | Regulatory | telecommunication | depending on |
-| | Emergency | service (emergency | Customer demands |
-| | Services | response, emergency| E.g.: |
-| | | dispatch) | |
-| | | * Critical Network | * 1A - Control; |
-| | | Infrastructure | * 1B - Real-time; |
-| | | Functions (e.g | * 1C - Data; |
-| | | VoLTE functions | |
-| | | DNS Servers,etc.) | May require 1+1 |
-| | | | Redundancy with |
-| | | | Instantaneous |
-| | | | Switchover |
-+-------------+-----------------+-----------------------+---------------------+
-|Level 2 | Enterprise and/ | * VPN | Sub-levels within |
-| | or large scale | * Real-time traffic | Level 2 may be |
-| | customers | (Voice and video) | created by the |
-| | (e.g. | * Network | Network Operator |
-| | Corporations, | Infrastructure | depending on |
-| | University) | Functions | Customer demands. |
-| | | supporting Level | E.g.: |
-| | Network | 2 services (e.g. | |
-| | Operators | VPN servers, | * 2A - VPN; |
-| | (Tier1/2/3) | Corporate Web/ | * 2B - Real-time; |
-| | service traffic | Mail servers) | * 2C - Data; |
-| | | | |
-| | | | May require 1:1 |
-| | | | Redundancy with |
-| | | | Fast (maybe |
-| | | | Instantaneous) |
-| | | | Switchover |
-+-------------+-----------------+-----------------------+---------------------+
-|Level 3 | General Consumer| * Data traffic | While this is |
-| | Public and ISP | (including voice | typically |
-| | Traffic | and video traffic | considered to be |
-| | | provided by OTT) | "Best Effort" |
-| | | * Network | traffic, it is |
-| | | Infrastructure | expected that |
-| | | Functions | Network Operators |
-| | | supporting Level | will devote |
-| | | 3 services | sufficient |
-| | | | resources to |
-| | | | assure |
-| | | | "satisfactory" |
-| | | | levels of |
-| | | | availability. |
-| | | | This level of |
-| | | | service may be |
-| | | | pre-empted by |
-| | | | those with |
-| | | | higher levels of |
-| | | | Service |
-| | | | Availability. May |
-| | | | require M+1 |
-| | | | Redundancy with |
-| | | | Fast Switchover; |
-| | | | where M > 1 and |
-| | | | the value of M to |
-| | | | be determined by |
-| | | | further study |
-+-------------+-----------------+-----------------------+---------------------+
-
-Requirements
-^^^^^^^^^^^^
-
-* It shall be possible to define different service availability levels
-* It shall be possible to classify the virtual resources for the different
- availability class levels
-* The VIM shall provide a mechanism by which VNF-specific requirements
- can be mapped to NFVI-specific capabilities.
-
-More specifically, the requirements and capabilities may or may not be made up of the
-same KPI-like strings, but the cloud administrator must be able to configure which
-HA-specific VNF requirements are satisfied by which HA-specific NFVI capabilities.
-
-
-
-6.1.2 Metrics for Service Availability
-======================================
-
-The [ETSI-NFV-REL_] identifies four metrics relevant to service
-availability:
-
-* Failure recovery time,
-* Failure impact fraction,
-* Failure frequency, and
-* Call drop rate.
-
-6.1.2.1 Failure Recovery Time
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The failure recovery time is the time interval from the occurrence of an abnormal
-event (e.g. failure, manual interruption of service, etc.) until the recovery of the
-service regardless if it is a scheduled or unscheduled abnormal event. For the
-unscheduled case, the recovery time includes the failure detection time and the
-failure restoration time.
-More specifically restoration also allows for a service recovery by the restart of
-the failed provider(s) while failover implies that the service is recovered by a
-redundant provider taking over the service. This provider may be a standby
-(i.e. synchronizing the service state with the active provider) or a spare
-(i.e. having no state information). Accordingly failover also means switchover, that
-is, an orederly takeover of the service from the active provider by the standby/spare.
-
-Requirements
-^^^^^^^^^^^^
-
-* It should be irrelevant whether the abnormal event is due to a scheduled or
- unscheduled operation or it is caused by a fault.
-* Failure detection mechanisms should be available in the NFVI and configurable so
- that the target recovery times can be met
-* Abnormal events should be logged and communicated (i.e. notifications and alarms as
- appropriate)
-
-The TL-9000 forum has specified a service interruption time of 15 seconds as outage
-for all traditional telecom system services. [ETSI-NFV-REL_]
-recommends the setting of different thresholds for the different Service Availability
-Levels. An example setting is given in the following table 2. Note that for all
-Service Availability levels Real-time Services require the fastest recovery time.
-Data services can tolerate longer recovery times. These recovery times are applicable
-to the user plane. A failure in the control plane does not have to impact the user plane.
-The main concern should be simultaneous failures in the control and user planes
-as the user plane cannot typically recover without the control plane. However an HA
-mechanism in VNF itself can further mitigate the risk. Note also that the impact on
-the user plane depends on the control plane service experiencing the failure,
-some of them are more critical than others.
-
-
-*Table 2: Example service recovery times for the service availability levels*
-
-+------------+-----------------+------------------------------------------+
-|SAL | Service | Notes |
-| | Recovery | |
-| | Time | |
-| | Threshold | |
-+============+=================+==========================================+
-|1 | 5 - 6 seconds | Recommendation: Redundant resources to be|
-| | | made available on-site to ensure fast |
-| | | recovery. |
-+------------+-----------------+------------------------------------------+
-|2 | 10 - 15 seconds | Recommendation: Redundant resources to be|
-| | | available as a mix of on-site and off- |
-| | | site as appropriate. |
-| | | |
-| | | * On-site resources to be utilized for |
-| | | recovery of real-time services. |
-| | | * Off-site resources to be utilized for |
-| | | recovery of data services. |
-+------------+-----------------+------------------------------------------+
-|3 | 20 - 25 seconds | Recommendation: Redundant resources to be|
-| | | mostly available off-site. Real-time |
-| | | services should be recovered before data |
-| | | services |
-+------------+-----------------+------------------------------------------+
-
-
-6.1.2.2 Failure Impact Fraction
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The failure impact fraction is the maximum percentage of the capacity or user
-population affected by a failure compared with the total capacity or the user
-population supported by a service. It is directly associated with the failure impact
-zone which is the set of resources/elements of the system to which the fault may
-propagate.
-
-Requirements
-^^^^^^^^^^^^
-
-* It should be possible to define the failure impact zone for all the elements of the
- system
-* At the detection of a failure of an element, its failure impact zone must be
- isolated before the associated recovery mechanism is triggered
-* If the isolation of the failure impact zone is unsuccessful the isolation should be
- attempted at the next higher level as soon as possible to prevent fault propagation.
-* It should be possible to define different levels of failure impact zones with
- associated isolation and alarm generation policies
-* It should be possible to limit the collocation of VMs to reduce the failure impact
- zone as well as to provide sufficient resources
-
-6.1.2.3 Failure Frequency
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Failure frequency is the number of failures in a certain period of time.
-
-Requirements
-^^^^^^^^^^^^
-
-* There should be a probation period for each failure impact zones within which
- failures are correlated.
-* The threshold and the probation period for the failure impact zones should be
- configurable
-* It should be possible to define failure escalation policies for the different
- failure impact zones
-
-
-6.1.2.4 Call Drop Rate
-^^^^^^^^^^^^^^^^^^^^^^
-
-Call drop rate reflects service continuity as well as system reliability and
-stability. The metric is inside the VNF and therefore is not specified further for
-the NFV environment.
-
-Requirements
-^^^^^^^^^^^^
-
-* It shall be possible to specify for each service availability class the associated
- availability metrics and their thresholds
-* It shall be possible to collect data for the defined metrics
-* It shall be possible to delegate the enforcement of some thresholds to the NFVI
-* Accordingly it shall be possible to request virtual resources with guaranteed
- characteristics, such as guaranteed latency between VMs (i.e. VNFCs), between a VM
- and storage, between VNFs
-
-
-**********************
-6.2 Service Continuity
-**********************
-
-The determining factor with respect to service continuity is the statefulness of the
-VNF. If the VNF is stateless, there is no state information which needs to be
-preserved to prevent the perception of service discontinuity in case of failure or
-other disruptive events.
-If the VNF is stateful, the NF has a service state which needs to be preserved
-throughout such disruptive events in order to shield the service consumer from these
-events and provide the perception of service continuity. A VNF may maintain this state
-internally or externally or a combination with or without the NFVI being aware of the
-purpose of the stored data.
-
-Requirements
-============
-
-* The NFVI should maintain the number of VMs provided to the VNF in the face of
- failures. I.e. the failed VM instances should be replaced by new VM instances
-* It should be possible to specify whether the NFVI or the VNF/VNFM handles the
- service recovery and continuity
-* If the VNF/VNFM handles the service recovery it should be able to receive error
- reports and/or detect failures in a timely manner.
-* The VNF (i.e. between VNFCs) may have its own fault detection mechanism, which might
- be triggered prior to receiving the error report from the underlying NFVI therefore
- the NFVI/VIM should not attempt to preserve the state of a failing VM if not
- configured to do so
-* The VNF/VNFM should be able to initiate the repair/reboot of resources of the VNFI
- (e.g. to recover from a fault persisting at the VNF level => failure impact zone
- escalation)
-* It should be possible to disallow the live migration of VMs and when it is allowed
- it should be possible to specify the tolerated interruption time.
-* It should be possible to restrict the simultaneous migration of VMs hosting a given
- VNF
-* It should be possible to define under which circumstances the NFV-MANO in
- collaboration with the NFVI should provide error handling (e.g. VNF handles local
- recoveries while NFV-MANO handles geo-redundancy)
-* The NFVI/VIM should provide virtual resource such as storage according to the needs
- of the VNF with the required guarantees (see virtual resource classification).
-* The VNF shall be able to define the information to be stored on its associated
- virtual storage
-* It should be possible to define HA requirements for the storage, its availability,
- accessibility, resilience options, i.e. the NFVI shall handle the failover for the
- storage.
-* The NFVI shall handle the network/connectivity failures transparent to the VNFs
-* The VNFs with different requirements should be able to coexist in the NFV Framework
-* The scale in/out is triggered by the VNF (VNFM) towards the VIM (to be executed in
- the NFVI)
-* It should be possible to define the metrics to monitor and the related thresholds
- that trigger the scale in/out operation
-* Scale in operation should not jeopardize availability (managed by the VNF/VNFM),
- i.e. resources can only be removed one at a time with a period in between sufficient
- for the VNF to restore any required redundancy.
-