From 2c49ef0d126104728aca2851bd98b209c563b2c2 Mon Sep 17 00:00:00 2001 From: Maria Toeroe Date: Tue, 11 Aug 2015 11:22:43 -0400 Subject: Create Section 6 in rst format Add line breaks and remove trailing white space Remove some more spaces Remove comments from etherpad Fix table and create a reference to the ETSI NFV-REL document Clarifications in section 6.1.2.1 and some formatting change Fix table and references and modify requirements in 6.1.1. JIRA: HA-2 Change-Id: I4ac99548d5e9e729ad150a3ad4166ae84f8a9b8e Signed-off-by: Maria Toeroe --- Section6_VNF_HA.rst | 329 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 329 insertions(+) create mode 100644 Section6_VNF_HA.rst diff --git a/Section6_VNF_HA.rst b/Section6_VNF_HA.rst new file mode 100644 index 0000000..afc84ac --- /dev/null +++ b/Section6_VNF_HA.rst @@ -0,0 +1,329 @@ +======================= +6 VNF High Availability +======================= + + +************************ +6.1 Service Availability +************************ + +In the context of NFV, Service Availability refers to the End-to-End (E2E) Service +Availability which includes all the elements in the end-to-end service (VNFs and +infrastructure components) with the exception of the customer terminal such as +handsets, computers, modems, etc. The service availability requirements for NFV +should be the same as those for legacy systems (for the same service). + +Service Availability = +total service available time / +(total service available time + total service recovery time) + +The service recovery time among others depends on the number of redundant resources +provisioned and/or instantiated that can be used for restoring the service. + +In the E2E relation a Network Service is available only of all the necessary +Network Functions are available and interconnected appropriately to collaborate +according to the NF chain. + +General Service Availability Requirements +========================================= + +* We need to be able to define the E2E (V)NF chain based on which the E2E availability + requirements can be decomposed into requirements applicable to individual VNFs and + their interconnections +* The interconnection of the VNFs should be logical and be maintained by the NFVI with + guaranteed characteristics, e.g. in case of failure the connection should be + restored within the acceptable tolerance time +* These characteristics should be maintained in VM migration, failovers and switchover, + scale in/out, etc. scenarios +* It should be possible to prioritize the different network services and their VNFs. + These priorities should be used when pre-emption policies are applied due to + resource shortage for example. +* VIM should support policies to prioritize a certain VNF. +* VIM should be able to provide classified virtual resources to VNFs in different SAL + +6.1.1 Service Availability Classification Levels +================================================ + +The [ETSI-NFV-REL_] defined three Service Availability Levels +(SAL) are classified in Table 1. They are based on the relevant ITU-T recommendations +and reflect the service types and the customer agreements a network operator should +consider. + +.. [ETSI-NFV-REL] `ETSI GS NFV-REL 001 V1.1.1 (2015-01) `_ + + +*Table 1: Service Availability classification levels* + ++-------------+-----------------+-----------------------+---------------------+ +|SAL Type | Customer Type | Service/Function | Notes | ++=============+=================+=======================+=====================+ +|Level 1 | Network Operator| * Intra-carrier | Sub-levels within | +| | Control Traffic | engineering | Level 1 may be | +| | | traffic | created by the | +| | Government/ | * Emergency | Network Operator | +| | Regulatory | telecommunication | depending on | +| | Emergency | service (emergency | Customer demands | +| | Services | response, emergency| E.g.: | +| | | dispatch) | | +| | | * Critical Network | * 1A - Control; | +| | | Infrastructure | * 1B - Real-time; | +| | | Functions (e.g | * 1C - Data; | +| | | VoLTE functions | | +| | | DNS Servers,etc.) | May require 1+1 | +| | | | Redundancy with | +| | | | Instantaneous | +| | | | Switchover | ++-------------+-----------------+-----------------------+---------------------+ +|Level 2 | Enterprise and/ | * VPN | Sub-levels within | +| | or large scale | * Real-time traffic | Level 2 may be | +| | customers | (Voice and video) | created by the | +| | (e.g. | * Network | Network Operator | +| | Corporations, | Infrastructure | depending on | +| | University) | Functions | Customer demands. | +| | | supporting Level | E.g.: | +| | Network | 2 services (e.g. | | +| | Operators | VPN servers, | * 2A - VPN; | +| | (Tier1/2/3) | Corporate Web/ | * 2B - Real-time; | +| | service traffic | Mail servers) | * 2C - Data; | +| | | | | +| | | | May require 1:1 | +| | | | Redundancy with | +| | | | Fast (maybe | +| | | | Instantaneous) | +| | | | Switchover | ++-------------+-----------------+-----------------------+---------------------+ +|Level 3 | General Consumer| * Data traffic | While this is | +| | Public and ISP | (including voice | typically | +| | Traffic | and video traffic | considered to be | +| | | provided by OTT) | "Best Effort" | +| | | * Network | traffic, it is | +| | | Infrastructure | expected that | +| | | Functions | Network Operators | +| | | supporting Level | will devote | +| | | 3 services | sufficient | +| | | | resources to | +| | | | assure | +| | | | "satisfactory" | +| | | | levels of | +| | | | availability. | +| | | | This level of | +| | | | service may be | +| | | | pre-empted by | +| | | | those with | +| | | | higher levels of | +| | | | Service | +| | | | Availability. May | +| | | | require M+1 | +| | | | Redundancy with | +| | | | Fast Switchover; | +| | | | where M > 1 and | +| | | | the value of M to | +| | | | be determined by | +| | | | further study | ++-------------+-----------------+-----------------------+---------------------+ + +Requirements +^^^^^^^^^^^^ + +* It shall be possible to define different service availability levels +* It shall be possible to classify the virtual resources for the different + availability class levels +* The VIM shall provide a mechanism by which VNF-specific requirements + can be mapped to NFVI-specific capabilities. + +More specifically, the requirements and capabilities may or may not be made up of the +same KPI-like strings, but the cloud administrator must be able to configure which +HA-specific VNF requirements are satisfied by which HA-specific NFVI capabilities. + + + +6.1.2 Metrics for Service Availability +====================================== + +The [ETSI-NFV-REL_] identifies four metrics relevant to service +availability: + +* Failure recovery time, +* Failure impact fraction, +* Failure frequency, and +* Call drop rate. + +6.1.2.1 Failure Recovery Time +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The failure recovery time is the time interval from the occurrence of an abnormal +event (e.g. failure, manual interruption of service, etc.) until the recovery of the +service regardless if it is a scheduled or unscheduled abnormal event. For the +unscheduled case, the recovery time includes the failure detection time and the +failure restoration time. +More specifically restoration also allows for a service recovery by the restart of +the failed provider(s) while failover implies that the service is recovered by a +redundant provider taking over the service. This provider may be a standby +(i.e. synchronizing the service state with the active provider) or a spare +(i.e. having no state information). Accordingly failover also means switchover, that +is, an orederly takeover of the service from the active provider by the standby/spare. + +Requirements +^^^^^^^^^^^^ + +* It should be irrelevant whether the abnormal event is due to a scheduled or + unscheduled operation or it is caused by a fault. +* Failure detection mechanisms should be available in the NFVI and configurable so + that the target recovery times can be met +* Abnormal events should be logged and communicated (i.e. notifications and alarms as + appropriate) + +The TL-9000 forum has specified a service interruption time of 15 seconds as outage +for all traditional telecom system services. [ETSI-NFV-REL_] +recommends the setting of different thresholds for the different Service Availability +Levels. An example setting is given in the following table 2. Note that for all +Service Availability levels Real-time Services require the fastest recovery time. +Data services can tolerate longer recovery times. These recovery times are applicable +to the user plane. A failure in the control plane does not have to impact the user plane. +The main concern should be simultaneous failures in the control and user planes +as the user plane cannot typically recover without the control plane. However an HA +mechanism in VNF itself can further mitigate the risk. Note also that the impact on +the user plane depends on the control plane service experiencing the failure, +some of them are more critical than others. + + +*Table 2: Example service recovery times for the service availability levels* + ++------------+-----------------+------------------------------------------+ +|SAL | Service | Notes | +| | Recovery | | +| | Time | | +| | Threshold | | ++============+=================+==========================================+ +|1 | 5 - 6 seconds | Recommendation: Redundant resources to be| +| | | made available on-site to ensure fast | +| | | recovery. | ++------------+-----------------+------------------------------------------+ +|2 | 10 - 15 seconds | Recommendation: Redundant resources to be| +| | | available as a mix of on-site and off- | +| | | site as appropriate. | +| | | | +| | | * On-site resources to be utilized for | +| | | recovery of real-time services. | +| | | * Off-site resources to be utilized for | +| | | recovery of data services. | ++------------+-----------------+------------------------------------------+ +|3 | 20 - 25 seconds | Recommendation: Redundant resources to be| +| | | mostly available off-site. Real-time | +| | | services should be recovered before data | +| | | services | ++------------+-----------------+------------------------------------------+ + + +6.1.2.2 Failure Impact Fraction +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The failure impact fraction is the maximum percentage of the capacity or user +population affected by a failure compared with the total capacity or the user +population supported by a service. It is directly associated with the failure impact +zone which is the set of resources/elements of the system to which the fault may +propagate. + +Requirements +^^^^^^^^^^^^ + +* It should be possible to define the failure impact zone for all the elements of the + system +* At the detection of a failure of an element, its failure impact zone must be + isolated before the associated recovery mechanism is triggered +* If the isolation of the failure impact zone is unsuccessful the isolation should be + attempted at the next higher level as soon as possible to prevent fault propagation. +* It should be possible to define different levels of failure impact zones with + associated isolation and alarm generation policies +* It should be possible to limit the collocation of VMs to reduce the failure impact + zone as well as to provide sufficient resources + +6.1.2.3 Failure Frequency +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Failure frequency is the number of failures in a certain period of time. + +Requirements +^^^^^^^^^^^^ + +* There should be a probation period for each failure impact zones within which + failures are correlated. +* The threshold and the probation period for the failure impact zones should be + configurable +* It should be possible to define failure escalation policies for the different + failure impact zones + + +6.1.2.4 Call Drop Rate +^^^^^^^^^^^^^^^^^^^^^^ + +Call drop rate reflects service continuity as well as system reliability and +stability. The metric is inside the VNF and therefore is not specified further for +the NFV environment. + +Requirements +^^^^^^^^^^^^ + +* It shall be possible to specify for each service availability class the associated + availability metrics and their thresholds +* It shall be possible to collect data for the defined metrics +* It shall be possible to delegate the enforcement of some thresholds to the NFVI +* Accordingly it shall be possible to request virtual resources with guaranteed + characteristics, such as guaranteed latency between VMs (i.e. VNFCs), between a VM + and storage, between VNFs + + +********************** +6.2 Service Continuity +********************** + +The determining factor with respect to service continuity is the statefulness of the +VNF. If the VNF is stateless, there is no state information which needs to be +preserved to prevent the perception of service discontinuity in case of failure or +other disruptive events. +If the VNF is stateful, the NF has a service state which needs to be preserved +throughout such disruptive events in order to shield the service consumer from these +events and provide the perception of service continuity. A VNF may maintain this state +internally or externally or a combination with or without the NFVI being aware of the +purpose of the stored data. + +Requirements +============ + +* The NFVI should maintain the number of VMs provided to the VNF in the face of + failures. I.e. the failed VM instances should be replaced by new VM instances +* It should be possible to specify whether the NFVI or the VNF/VNFM handles the + service recovery and continuity +* If the VNF/VNFM handles the service recovery it should be able to receive error + reports and/or detect failures in a timely manner. +* The VNF (i.e. between VNFCs) may have its own fault detection mechanism, which might + be triggered prior to receiving the error report from the underlying NFVI therefore + the NFVI/VIM should not attempt to preserve the state of a failing VM if not + configured to do so +* The VNF/VNFM should be able to initiate the repair/reboot of resources of the VNFI + (e.g. to recover from a fault persisting at the VNF level => failure impact zone + escalation) +* It should be possible to disallow the live migration of VMs and when it is allowed + it should be possible to specify the tolerated interruption time. +* It should be possible to restrict the simultaneous migration of VMs hosting a given + VNF +* It should be possible to define under which circumstances the NFV-MANO in + collaboration with the NFVI should provide error handling (e.g. VNF handles local + recoveries while NFV-MANO handles geo-redundancy) +* The NFVI/VIM should provide virtual resource such as storage according to the needs + of the VNF with the required guarantees (see virtual resource classification). +* The VNF shall be able to define the information to be stored on its associated + virtual storage +* It should be possible to define HA requirements for the storage, its availability, + accessibility, resilience options, i.e. the NFVI shall handle the failover for the + storage. +* The NFVI shall handle the network/connectivity failures transparent to the VNFs +* The VNFs with different requirements should be able to coexist in the NFV Framework +* The scale in/out is triggered by the VNF (VNFM) towards the VIM (to be executed in + the NFVI) +* It should be possible to define the metrics to monitor and the related thresholds + that trigger the scale in/out operation +* Scale in operation should not jeopardize availability (managed by the VNF/VNFM), + i.e. resources can only be removed one at a time with a period in between sufficient + for the VNF to restore any required redundancy. + -- cgit 1.2.3-korg