diff options
Diffstat (limited to 'Section6_VNF_HA.rst')
-rw-r--r-- | Section6_VNF_HA.rst | 329 |
1 files changed, 0 insertions, 329 deletions
diff --git a/Section6_VNF_HA.rst b/Section6_VNF_HA.rst deleted file mode 100644 index afc84ac..0000000 --- a/Section6_VNF_HA.rst +++ /dev/null @@ -1,329 +0,0 @@ -======================= -6 VNF High Availability -======================= - - -************************ -6.1 Service Availability -************************ - -In the context of NFV, Service Availability refers to the End-to-End (E2E) Service -Availability which includes all the elements in the end-to-end service (VNFs and -infrastructure components) with the exception of the customer terminal such as -handsets, computers, modems, etc. The service availability requirements for NFV -should be the same as those for legacy systems (for the same service). - -Service Availability = -total service available time / -(total service available time + total service recovery time) - -The service recovery time among others depends on the number of redundant resources -provisioned and/or instantiated that can be used for restoring the service. - -In the E2E relation a Network Service is available only of all the necessary -Network Functions are available and interconnected appropriately to collaborate -according to the NF chain. - -General Service Availability Requirements -========================================= - -* We need to be able to define the E2E (V)NF chain based on which the E2E availability - requirements can be decomposed into requirements applicable to individual VNFs and - their interconnections -* The interconnection of the VNFs should be logical and be maintained by the NFVI with - guaranteed characteristics, e.g. in case of failure the connection should be - restored within the acceptable tolerance time -* These characteristics should be maintained in VM migration, failovers and switchover, - scale in/out, etc. scenarios -* It should be possible to prioritize the different network services and their VNFs. - These priorities should be used when pre-emption policies are applied due to - resource shortage for example. -* VIM should support policies to prioritize a certain VNF. -* VIM should be able to provide classified virtual resources to VNFs in different SAL - -6.1.1 Service Availability Classification Levels -================================================ - -The [ETSI-NFV-REL_] defined three Service Availability Levels -(SAL) are classified in Table 1. They are based on the relevant ITU-T recommendations -and reflect the service types and the customer agreements a network operator should -consider. - -.. [ETSI-NFV-REL] `ETSI GS NFV-REL 001 V1.1.1 (2015-01) <http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf>`_ - - -*Table 1: Service Availability classification levels* - -+-------------+-----------------+-----------------------+---------------------+ -|SAL Type | Customer Type | Service/Function | Notes | -+=============+=================+=======================+=====================+ -|Level 1 | Network Operator| * Intra-carrier | Sub-levels within | -| | Control Traffic | engineering | Level 1 may be | -| | | traffic | created by the | -| | Government/ | * Emergency | Network Operator | -| | Regulatory | telecommunication | depending on | -| | Emergency | service (emergency | Customer demands | -| | Services | response, emergency| E.g.: | -| | | dispatch) | | -| | | * Critical Network | * 1A - Control; | -| | | Infrastructure | * 1B - Real-time; | -| | | Functions (e.g | * 1C - Data; | -| | | VoLTE functions | | -| | | DNS Servers,etc.) | May require 1+1 | -| | | | Redundancy with | -| | | | Instantaneous | -| | | | Switchover | -+-------------+-----------------+-----------------------+---------------------+ -|Level 2 | Enterprise and/ | * VPN | Sub-levels within | -| | or large scale | * Real-time traffic | Level 2 may be | -| | customers | (Voice and video) | created by the | -| | (e.g. | * Network | Network Operator | -| | Corporations, | Infrastructure | depending on | -| | University) | Functions | Customer demands. | -| | | supporting Level | E.g.: | -| | Network | 2 services (e.g. | | -| | Operators | VPN servers, | * 2A - VPN; | -| | (Tier1/2/3) | Corporate Web/ | * 2B - Real-time; | -| | service traffic | Mail servers) | * 2C - Data; | -| | | | | -| | | | May require 1:1 | -| | | | Redundancy with | -| | | | Fast (maybe | -| | | | Instantaneous) | -| | | | Switchover | -+-------------+-----------------+-----------------------+---------------------+ -|Level 3 | General Consumer| * Data traffic | While this is | -| | Public and ISP | (including voice | typically | -| | Traffic | and video traffic | considered to be | -| | | provided by OTT) | "Best Effort" | -| | | * Network | traffic, it is | -| | | Infrastructure | expected that | -| | | Functions | Network Operators | -| | | supporting Level | will devote | -| | | 3 services | sufficient | -| | | | resources to | -| | | | assure | -| | | | "satisfactory" | -| | | | levels of | -| | | | availability. | -| | | | This level of | -| | | | service may be | -| | | | pre-empted by | -| | | | those with | -| | | | higher levels of | -| | | | Service | -| | | | Availability. May | -| | | | require M+1 | -| | | | Redundancy with | -| | | | Fast Switchover; | -| | | | where M > 1 and | -| | | | the value of M to | -| | | | be determined by | -| | | | further study | -+-------------+-----------------+-----------------------+---------------------+ - -Requirements -^^^^^^^^^^^^ - -* It shall be possible to define different service availability levels -* It shall be possible to classify the virtual resources for the different - availability class levels -* The VIM shall provide a mechanism by which VNF-specific requirements - can be mapped to NFVI-specific capabilities. - -More specifically, the requirements and capabilities may or may not be made up of the -same KPI-like strings, but the cloud administrator must be able to configure which -HA-specific VNF requirements are satisfied by which HA-specific NFVI capabilities. - - - -6.1.2 Metrics for Service Availability -====================================== - -The [ETSI-NFV-REL_] identifies four metrics relevant to service -availability: - -* Failure recovery time, -* Failure impact fraction, -* Failure frequency, and -* Call drop rate. - -6.1.2.1 Failure Recovery Time -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The failure recovery time is the time interval from the occurrence of an abnormal -event (e.g. failure, manual interruption of service, etc.) until the recovery of the -service regardless if it is a scheduled or unscheduled abnormal event. For the -unscheduled case, the recovery time includes the failure detection time and the -failure restoration time. -More specifically restoration also allows for a service recovery by the restart of -the failed provider(s) while failover implies that the service is recovered by a -redundant provider taking over the service. This provider may be a standby -(i.e. synchronizing the service state with the active provider) or a spare -(i.e. having no state information). Accordingly failover also means switchover, that -is, an orederly takeover of the service from the active provider by the standby/spare. - -Requirements -^^^^^^^^^^^^ - -* It should be irrelevant whether the abnormal event is due to a scheduled or - unscheduled operation or it is caused by a fault. -* Failure detection mechanisms should be available in the NFVI and configurable so - that the target recovery times can be met -* Abnormal events should be logged and communicated (i.e. notifications and alarms as - appropriate) - -The TL-9000 forum has specified a service interruption time of 15 seconds as outage -for all traditional telecom system services. [ETSI-NFV-REL_] -recommends the setting of different thresholds for the different Service Availability -Levels. An example setting is given in the following table 2. Note that for all -Service Availability levels Real-time Services require the fastest recovery time. -Data services can tolerate longer recovery times. These recovery times are applicable -to the user plane. A failure in the control plane does not have to impact the user plane. -The main concern should be simultaneous failures in the control and user planes -as the user plane cannot typically recover without the control plane. However an HA -mechanism in VNF itself can further mitigate the risk. Note also that the impact on -the user plane depends on the control plane service experiencing the failure, -some of them are more critical than others. - - -*Table 2: Example service recovery times for the service availability levels* - -+------------+-----------------+------------------------------------------+ -|SAL | Service | Notes | -| | Recovery | | -| | Time | | -| | Threshold | | -+============+=================+==========================================+ -|1 | 5 - 6 seconds | Recommendation: Redundant resources to be| -| | | made available on-site to ensure fast | -| | | recovery. | -+------------+-----------------+------------------------------------------+ -|2 | 10 - 15 seconds | Recommendation: Redundant resources to be| -| | | available as a mix of on-site and off- | -| | | site as appropriate. | -| | | | -| | | * On-site resources to be utilized for | -| | | recovery of real-time services. | -| | | * Off-site resources to be utilized for | -| | | recovery of data services. | -+------------+-----------------+------------------------------------------+ -|3 | 20 - 25 seconds | Recommendation: Redundant resources to be| -| | | mostly available off-site. Real-time | -| | | services should be recovered before data | -| | | services | -+------------+-----------------+------------------------------------------+ - - -6.1.2.2 Failure Impact Fraction -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The failure impact fraction is the maximum percentage of the capacity or user -population affected by a failure compared with the total capacity or the user -population supported by a service. It is directly associated with the failure impact -zone which is the set of resources/elements of the system to which the fault may -propagate. - -Requirements -^^^^^^^^^^^^ - -* It should be possible to define the failure impact zone for all the elements of the - system -* At the detection of a failure of an element, its failure impact zone must be - isolated before the associated recovery mechanism is triggered -* If the isolation of the failure impact zone is unsuccessful the isolation should be - attempted at the next higher level as soon as possible to prevent fault propagation. -* It should be possible to define different levels of failure impact zones with - associated isolation and alarm generation policies -* It should be possible to limit the collocation of VMs to reduce the failure impact - zone as well as to provide sufficient resources - -6.1.2.3 Failure Frequency -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Failure frequency is the number of failures in a certain period of time. - -Requirements -^^^^^^^^^^^^ - -* There should be a probation period for each failure impact zones within which - failures are correlated. -* The threshold and the probation period for the failure impact zones should be - configurable -* It should be possible to define failure escalation policies for the different - failure impact zones - - -6.1.2.4 Call Drop Rate -^^^^^^^^^^^^^^^^^^^^^^ - -Call drop rate reflects service continuity as well as system reliability and -stability. The metric is inside the VNF and therefore is not specified further for -the NFV environment. - -Requirements -^^^^^^^^^^^^ - -* It shall be possible to specify for each service availability class the associated - availability metrics and their thresholds -* It shall be possible to collect data for the defined metrics -* It shall be possible to delegate the enforcement of some thresholds to the NFVI -* Accordingly it shall be possible to request virtual resources with guaranteed - characteristics, such as guaranteed latency between VMs (i.e. VNFCs), between a VM - and storage, between VNFs - - -********************** -6.2 Service Continuity -********************** - -The determining factor with respect to service continuity is the statefulness of the -VNF. If the VNF is stateless, there is no state information which needs to be -preserved to prevent the perception of service discontinuity in case of failure or -other disruptive events. -If the VNF is stateful, the NF has a service state which needs to be preserved -throughout such disruptive events in order to shield the service consumer from these -events and provide the perception of service continuity. A VNF may maintain this state -internally or externally or a combination with or without the NFVI being aware of the -purpose of the stored data. - -Requirements -============ - -* The NFVI should maintain the number of VMs provided to the VNF in the face of - failures. I.e. the failed VM instances should be replaced by new VM instances -* It should be possible to specify whether the NFVI or the VNF/VNFM handles the - service recovery and continuity -* If the VNF/VNFM handles the service recovery it should be able to receive error - reports and/or detect failures in a timely manner. -* The VNF (i.e. between VNFCs) may have its own fault detection mechanism, which might - be triggered prior to receiving the error report from the underlying NFVI therefore - the NFVI/VIM should not attempt to preserve the state of a failing VM if not - configured to do so -* The VNF/VNFM should be able to initiate the repair/reboot of resources of the VNFI - (e.g. to recover from a fault persisting at the VNF level => failure impact zone - escalation) -* It should be possible to disallow the live migration of VMs and when it is allowed - it should be possible to specify the tolerated interruption time. -* It should be possible to restrict the simultaneous migration of VMs hosting a given - VNF -* It should be possible to define under which circumstances the NFV-MANO in - collaboration with the NFVI should provide error handling (e.g. VNF handles local - recoveries while NFV-MANO handles geo-redundancy) -* The NFVI/VIM should provide virtual resource such as storage according to the needs - of the VNF with the required guarantees (see virtual resource classification). -* The VNF shall be able to define the information to be stored on its associated - virtual storage -* It should be possible to define HA requirements for the storage, its availability, - accessibility, resilience options, i.e. the NFVI shall handle the failover for the - storage. -* The NFVI shall handle the network/connectivity failures transparent to the VNFs -* The VNFs with different requirements should be able to coexist in the NFV Framework -* The scale in/out is triggered by the VNF (VNFM) towards the VIM (to be executed in - the NFVI) -* It should be possible to define the metrics to monitor and the related thresholds - that trigger the scale in/out operation -* Scale in operation should not jeopardize availability (managed by the VNF/VNFM), - i.e. resources can only be removed one at a time with a period in between sufficient - for the VNF to restore any required redundancy. - |