diff options
author | fuqiao <fuqiao@chinamobile.com> | 2016-01-28 10:05:33 +0800 |
---|---|---|
committer | fuqiao <fuqiao@chinamobile.com> | 2016-01-28 10:05:33 +0800 |
commit | c3d65029c1f07fd11c5ec754b194ac1357b8e0ce (patch) | |
tree | 1353e16ebffe84ae16e914c263ae7f6aa3919dc4 /UseCases | |
parent | 96ea75a56d8bc93a51cbd8a6ecffdcc5456e3d8b (diff) |
Remove all the old seperate sections. Will Reorganize and upload again
Will upload the seperate sections in the next patch.
JIRA:HA-1
Diffstat (limited to 'UseCases')
30 files changed, 0 insertions, 888 deletions
diff --git a/UseCases/UseCases.rst b/UseCases/UseCases.rst deleted file mode 100644 index 57dfbc0..0000000 --- a/UseCases/UseCases.rst +++ /dev/null @@ -1,731 +0,0 @@ -============ -HA Use Cases -============ - -************** -1 Introduction -************** - -This use case document outlines the model and failure modes for NFV systems. Its goal is along -with the requirements documents and gap analysis help set context for engagement with various -upstream projects. The OPNFV HA project team continuously evolving these documents, and in -particular this use case document starting with a set of basic use cases. - -***************** -2 Basic Use Cases -***************** - - -In this section we review some of the basic use cases related to service high availability, -that is, the availability of the service or function provided by a VNF. The goal is to -understand the different scenarios that need to be considered and the specific requirements -to provide service high availability. More complex use cases will be discussed in -other sections. - -With respect to service high availability we need to consider whether a VNF implementation is -statefull or stateless and if it includes or not an HA manager which handles redundancy. -For statefull VNFs we can also distinguish the cases when the state is maintained inside -of the VNF or it is stored in an external shared storage making the VNF itself virtually -stateless. - -Managing availability usually implies a fault detection mechanism, which triggers the -actions necessary for fault isolation followed by the recovery from the fault. -This recovery includes two parts: - -* the recovery of the service and -* the repair of the failed entity. - -Very often the recovery of the service and the repair actions are perceived to be the same, for -example, restarting a failed application repairs the application, which then provides the service again. -Such a restart may take significant time causing service outage, for which redundancy is the solution. -In cases when the service is protected by redundancy of the providing entities (e.g. application -processes), the service is "failed over" to the standby or a spare entity, which replaces the -failed entity while it is being repaired. E.g. when an application process providing the service fails, -the standby application process takes over providing the service, while the failed one is restarted. -Such a failover often allows for faster recovery of the service. - -We also need to distinguish between the failed and the faulty entities as a fault may or -may not manifest in the entity containing the fault. Faults may propagate, i.e. cause other entities -to fail or misbehave, i.e. an error, which in turn might be detected by a different failure or -error detector entity each of which has its own scope. Similarly, the managers acting on these -detected errors may have a limited scope. E.g. an HA manager contained in a VNF can only repair -entities within the VNF. It cannot repair a failed VM, in fact due to the layered architecture -in the VNF it cannot even know whether the VM failed, its hosting hypervisor, or the physical host. -But its error detection mechanism will detect the result of such failures - a failure in the VNF - -and the service can be recovered at the VNF level. -On the other hand, the failure should be detected in the NFVI and the VIM should repair the failed -entity (e.g. the VM). Accordingly a failure may be detected by different managers in different layers -of the system, each of which may react to the event. This may cause interference. -Thus, to resolve the problem in a consistent manner and completely recover from -a failure the managers may need to collaborate and coordinate their actions. - -Considering all these issues the following basic use cases can be identified (see table 1.). -These use cases assume that the failure is detected in the faulty entity (VNF component -or the VM). - - -*Table 1: VNF high availability use cases* - -+---------+-------------------+----------------+-------------------+----------+ -| | VNF Statefullness | VNF Redundancy | Failure detection | Use Case | -+=========+===================+================+===================+==========+ -| VNF | yes | yes | VNF level only | UC1 | -| | | +-------------------+----------+ -| | | | VNF & NFVI levels | UC2 | -| | +----------------+-------------------+----------+ -| | | no | VNF level only | UC3 | -| | | +-------------------+----------+ -| | | | VNF & NFVI levels | UC4 | -| +-------------------+----------------+-------------------+----------+ -| | no | yes | VNF level only | UC5 | -| | | +-------------------+----------+ -| | | | VNF & NFVI levels | UC6 | -| | +----------------+-------------------+----------+ -| | | no | VNF level only | UC7 | -| | | +-------------------+----------+ -| | | | VNF & NFVI levels | UC8 | -+---------+-------------------+----------------+-------------------+----------+ - -As discussed, there is no guarantee that a fault manifests within the faulty entity. For -example, a memory leak in one process may impact or even crash any other process running in -the same execution environment. Accordingly, the repair of a failing entity (i.e. the crashed process) -may not resolve the problem and soon the same or another process may fail within this execution -environment indicating that the fault has remained in the system. -Thus, there is a need for extrapolating the failure to a wider scope and perform the -recovery at that level to get rid of the problem (at least temporarily till a patch is available -for our leaking process). -This requires the correlation of repeated failures in a wider scope and the escalation of the -recovery action to this wider scope. In the layered architecture this means that the manager detecting the -failure may not be the one in charge of the scope at which it can be resolved, so the escalation needs to -be forwarded to the manager in charge of that scope, which brings us to an additional use case UC9. - -We need to consider for each of these use cases the events detected, their impact on other entities, -and the actions triggered to recover the service provided by the VNF, and to repair the -faulty entity. - -We are going to describe each of the listed use cases from this perspective to better -understand how the problem of service high availability can be tackled the best. - -Before getting into the details it is worth mentioning the example end-to-end service recovery -times provided in the ETSI NFV REL document [REL]_ (see table 2.). These values may change over time -including lowering these thresholds. - -*Table 2: Service availability levels (SAL)* - -+----+---------------+----------------------+------------------------------------+ -|SAL |Service |Customer Type | Recommendation | -| |Recovery | | | -| |Time | | | -| |Threshold | | | -+====+===============+======================+====================================+ -|1 |5 - 6 seconds |Network Operator |Redundant resources to be | -| | |Control Traffic |made available on-site to | -| | | |ensure fastrecovery. | -| | |Government/Regulatory | | -| | |Emergency Services | | -+----+---------------+----------------------+------------------------------------+ -|2 |10 - 15 seconds|Enterprise and/or |Redundant resources to be available | -| | |large scale customers |as a mix of on-site and off-site | -| | | |as appropriate: On-site resources to| -| | |Network Operators |be utilized for recovery of | -| | |service traffic |real-time service; Off-site | -| | | |resources to be utilized for | -| | | |recovery of data services | -+----+---------------+----------------------+------------------------------------+ -|3 |20 - 25 seconds|General Consumer |Redundant resources to be mostly | -| | |Public and ISP |available off-site. Real-time | -| | |Traffic |services should be recovered before | -| | | |data services | -+----+---------------+----------------------+------------------------------------+ - -Note that even though SAL 1 of [REL]_ allows for 5-6 seconds of service recovery, -for many services this is too long and such outage causes a service level reset or -the loss of significant amount of data. Also the end-to-end service or network service -may be served by multiple VNFs. Therefore for a single VNF the desired -service recovery time is sub-second. - -Note that failing over the service to another provider entity implies the redirection of the traffic -flow the VNF is handling. This could be achieved in different ways ranging from floating IP addresses -to load balancers. The topic deserves its own investigation, therefore in these first set of -use cases we assume that it is part of the solution without going into the details, which -we will address as a complementary set of use cases. - -.. [REL] ETSI GS NFV-REL 001 V1.1.1 (2015-01) - - -2.1 Use Case 1: VNFC failure in a statefull VNF with redundancy -============================================================== - -Use case 1 represents a statefull VNF with redundancy managed by an HA manager, -which is part of the VNF (Fig 1). The VNF consists of VNFC1, VNFC2 and the HA Manager. -The latter managing the two VNFCs, e.g. the role they play in providing the service -named "Provided NF" (Fig 2). - -The failure happens in one of the VNFCs and it is detected and handled by the HA manager. -On practice the HA manager could be part of the VNFC implementations or it could -be a separate entity in the VNF. The point is that the communication of these -entities inside the VNF is not visible to the rest of the system. The observable -events need to cross the boundary represented by the VNF box. - - -.. figure:: images/Slide4.png - :alt: VNFC failure in a statefull VNF - :figclass: align-center - - Fig 1. VNFC failure in a statefull VNF with built-in HA manager - - -.. figure:: images/StatefullVNF-VNFCfailure.png - :alt: MSC of the VNFC failure in a statefull VNF - :figclass: align-center - - Fig 2. Sequence of events for use case 1 - - -As shown in Fig 2. initially VNFC2 is active, i.e. provides the Provided NF and VNFC1 -is a standby. It is not shown, but it is expected that VNFC1 has some means to get the update -of the state of the Provided NF from the active VNFC2, so that it is prepared to continue to -provide the service in case VNFC2 fails. -The sequence of events starts with the failure of VNFC2, which also interrupts the -Provided NF. This failure is detected somehow and/or reported to the HA Manager, which -in turn may report the failure to the VNFM and simultaneously it tries to isolate the -fault by cleaning up VNFC2. - -Once the cleanup succeeds (i.e. the OK is received) it fails over the active role to -VNFC1 by setting it active. This recovers the service, the Provided NF is indeed -provided again. Thus this point marks the end of the outage caused by the failure -that need to be considered from the perspective of service availability. - -The repair of the failed VNFC2, which might have started at the same time -when VNFC1 was assigned the active state, may take longer but without further impact -on the availability of the Provided NF service. -If the HA Manager reported the interruption of the Provided NF to the VNFM, it should -clear the error condition. - -The key points in this scenario are: - -* The failure of the VNFC2 is not detectable by any other part of the system except - the consumer of the Provided NF. The VNFM only - knows about the failure because of the error report, and only the information this - report provides. I.e. it may or may not include the information on what failed. -* The Provided NF is resumed as soon as VNFC1 is assigned active regardless how long - it takes to repair VNFC2. -* The HA manager could be part of the VNFM as well. This requires an interface to - detect the failures and to manage the VNFC life-cycle and the role assignments. - -2.2 Use Case 2: VM failure in a statefull VNF with redundacy -============================================================ - -Use case 2 also represents a statefull VNF with its redundancy managed by an HA manager, -which is part of the VNF. The VNFCs of the VNF are hosted on the VMs provided by -the NFVI (Fig 3). - -The VNF consists of VNFC1, VNFC2 and the HA Manager (Fig 4). The latter managing -the role the VNFCs play in providing the service - Provided NF. -The VMs provided by the NFVI are managed by the VIM. - - -In this use case it is one of the VMs hosting the VNF fails. The failure is detected -and handled at both the NFVI and the VNF levels simultaneously. The coordination occurs -between the VIM and the VNFM. - - -.. figure:: images/Slide6.png - :alt: VM failure in a statefull VNF - :figclass: align-center - - Fig 3. VM failure in a statefull VNF with built-in HA manager - - -.. figure:: images/StatefullVNF-VMfailure.png - :alt: MSC of the VM failure in a statefull VNF - :figclass: align-center - - Fig 4. Sequence of events for use case 2 - - -Again initially VNFC2 is active and provides the Provided NF, while VNFC1 is the standby. -It is not shown in Fig 4., but it is expected that VNFC1 has some means to learn the state -of the Provided NF from the active VNFC2, so that it is able to continue providing the -service if VNFC2 fails. VNFC1 is hosted on VM1, while VNFC2 is hosted on VM2 as indicated by -the arrows between these objects in Fig 4. - -The sequence of events starts with the failure of VM2, which results in VNFC2 failing and -interrupting the Provided NF. The HA Manager detects the failure of VNFC2 somehow -and tries to handle it the same way as in use case 1. However because the VM is gone the -clean up either not initiated at all or interrupted as soon as the failure of the VM is -identified. In either case the faulty VNFC2 is considered as isolated. - -To recover the service the HA Manager fails over the active role to VNFC1 by setting it active. -This recovers the Provided NF. Thus this point marks again the end of the outage caused -by the VM failure that need to be considered from the perspective of service availability. -If the HA Manager reported the interruption of the Provided NF to the VNFM, it should -clear the error condition. - -On the other hand the failure of the VM is also detected in the NFVI and reported to the VIM. -The VIM reports the VM failure to the VNFM, which passes on this information -to the HA Manager of the VNF. This confirms for the VNF HA Manager the VM failure and that -it needs to wait with the repair of the failed VNFC2 until the VM is provided again. The -VNFM also confirms towards the VIM that it is safe to restart the VM. - -The repair of the failed VM may take some time, but since the service has been failed over -to VNFC1 in the VNF, there is no further impact on the availability of Provided NF. - -When eventually VM2 is restarted the VIM reports this to the VNFM and -the VNFC2 can be restored. - -The key points in this scenario are: - -* The failure of the VM2 is detectable at both levels VNF and NFVI, therefore both the HA - manager and the VIM reacts to it. It is essential that these reactions do not interfere, - e.g. if the VIM tries to protect the VM state at NFVI level that would conflict with the - service failover action at the VNF level. -* While the failure detection happens at both NFVI and VNF levels, the time frame within - which the VIM and the HA manager detect and react may be very different. For service - availability the VNF level detection, i.e. by the HA manager is the critical one and expected - to be faster. -* The Provided NF is resumed as soon as VNFC1 is assigned active regardless how long - it takes to repair VM2 and VNFC2. -* The HA manager could be part of the VNFM as well. - This requires an interface to detect failures in/of the VNFC and to manage its life-cycle and - role assignments. -* The VNFM may not know for sure that the VM failed until the VIM reports it, i.e. whether - the VM failure is due to host, hypervisor, host OS failure. Thus the VIM should report/alarm - and log VM, hypervisor, and physical host failures. The use cases for these failures - are similar with respect to the Provided NF. -* The VM repair also should start with the fault isolation as appropriate for the actual - failed entity, e.g. if the VM failed due to a host failure a host may be fenced first. -* The negotiation between the VNFM and the VIM may be replaced by configured repair actions. - E.g. on error restart VM in initial state, restart VM from last snapshot, or fail VM over to standby. - - -2.3 Use Case 3: VNFC failure in a statefull VNF with no redundancy -================================================================= - -Use case 3 also represents a statefull VNF, but it stores its state externally on a -virtual disk provided by the NFVI. It has a single VNFC and it is managed by the VNFM -(Fig 5). - -In this use case the VNFC fails and the failure is detected and handled by the VNFM. - - -.. figure:: images/Slide10.png - :alt: VNFC failure in a statefull VNF No-Red - :figclass: align-center - - Fig 5. VNFC failure in a statefull VNF with no redundancy - - -.. figure:: images/StatefullVNF-VNFCfailureNoRed.png - :alt: MSC of the VNFC failure in a statefull VNF No-Red - :figclass: align-center - - Fig 6. Sequence of events for use case 3 - - -The VNFC periodically checkpoints the state of the Provided NF to the external storage, -so that in case of failure the Provided NF can be resumed (Fig 6). - -When the VNFC fails the Provided NF is interrupted. The failure is detected by the VNFM -somehow, which to isolate the fault first cleans up the VNFC, then if the cleanup is -successful it restarts the VNFC. When the VNFC starts up, first it reads the last checkpoint -for the Provided NF, then resumes providing it. The service outage lasts from the VNFC failure -till this moment. - -The key points in this scenario are: - -* The service state is saved in an external storage which should be highly available too to - protect the service. -* The NFVI should provide this guarantee and also that storage and access network failures - are handled seemlessly from the VNF's perspective. -* The VNFM has means to detect VNFC failures and manage its life-cycle appropriately. This is - not required if the VNF also provides its availability management. -* The Provided NF can be resumed only after the VNFC is restarted and it has restored the - service state from the last checkpoint created before the failure. -* Having a spare VNFC can speed up the service recovery. This requires that the VNFM coordinates - the role each VNFC takes with respect to the Provided NF. I.e. the VNFCs do not act on the - stored state simultaneously potentially interfering and corrupting it. - - - -2.4 Use Case 4: VM failure in a statefull VNF with no redundancy -=============================================================== - -Use case 4 also represents a statefull VNF without redundancy, which stores its state externally on a -virtual disk provided by the NFVI. It has a single VNFC managed by the VNFM -(Fig 7) as in use case 3. - -In this use case the VM hosting the VNFC fails and the failure is detected and handled by -the VNFM and the VIM simultaneously. - - -.. figure:: images/Slide11.png - :alt: VM failure in a statefull VNF No-Red - :figclass: align-center - - Fig 7. VM failure in a statefull VNF with no redundancy - -.. figure:: images/StatefullVNF-VMfailureNoRed.png - :alt: MSC of the VM failure in a statefull VNF No-Red - :figclass: align-center - - Fig 8. Sequence of events for use case 4 - -Again, the VNFC regularly checkpoints the state of the Provided NF to the external storage, -so that it can be resumed in case of a failure (Fig 8). - -When the VM hosting the VNFC fails the Provided NF is interrupted. - -On the one hand side, the failure is detected by the VNFM somehow, which to isolate the fault tries -to clean the VNFC up which cannot be done because of the VM failure. When the absence of the VM has been -determined the VNFM has to wait with restarting the VNFC until the hosting VM is restored. The VNFM -may report the problem to the VIM, requesting a repair. - -On the other hand the failure is detected in the NFVI and reported to the VIM, which reports it -to the VNFM, if the VNFM hasn't reported it yet. -If the VNFM has requested the VM repair or if it acknowledges the repair, the VIM restarts the VM. -Once the VM is up the VIM reports it to the VNFM, which in turn can restart the VNFC. - -When the VNFC restarts first it reads the last checkpoint for the Provided NF, -to be able to resume it. -The service outage last until this is recovery completed. - -The key points in this scenario are: - - -* The service state is saved in external storage which should be highly available to - protect the service. -* The NFVI should provide such a guarantee and also that storage and access network failures - are handled seemlessly from the perspective of the VNF. -* The Provided NF can be resumed only after the VM and the VNFC are restarted and the VNFC - has restored the service state from the last checkpoint created before the failure. -* The VNFM has means to detect VNFC failures and manage its life-cycle appropriately. Alternatively - the VNF may also provide its availability management. -* The VNFM may not know for sure that the VM failed until the VIM reports this. It also cannot - distinguish host, hypervisor and host OS failures. Thus the VIM should report/alarm and log - VM, hypervisor, and physical host failures. The use cases for these failures are - similar with respect to the Provided NF. -* The VM repair also should start with the fault isolation as appropriate for the actual - failed entity, e.g. if the VM failed due to a host failure a host may be fenced first. -* The negotiation between the VNFM and the VIM may be replaced by configured repair actions. -* VM level redundancy, i.e. running a standby or spare VM in the NFVI would allow faster service - recovery for this use case, but by itself it may not protect against VNFC level failures. I.e. - VNFC level error detection is still required. - - - -2.5 Use Case 5: VNFC failure in a stateless VNF with redundancy -=============================================================== - -Use case 5 represents a stateless VNF with redundancy, i.e. it is composed of VNFC1 and VNFC2. -They are managed by an HA manager within the VNF. The HA manager assigns the active role to provide -the Provided NF to one of the VNFCs while the other remains a spare meaning that it has no state -information for the Provided NF (Fig 9) therefore it could replace any other VNFC capable of -providing the Provided NF service. - -In this use case the VNFC fails and the failure is detected and handled by the HA manager. - - -.. figure:: images/Slide13.png - :alt: VNFC failure in a stateless VNF with redundancy - :figclass: align-center - - Fig 9. VNFC failure in a stateless VNF with redundancy - - -.. figure:: images/StatelessVNF-VNFCfailure.png - :alt: MSC of the VNFC failure in a stateless VNF with redundancy - :figclass: align-center - - Fig 10. Sequence of events for use case 5 - - -Initially VNFC2 provides the Provided NF while VNFC1 is idle or might not even been instantiated -yet (Fig 10). - -When VNFC2 fails the Provided NF is interrupted. This failure is detected by the HA manager, -which as a first reaction cleans up VNFC2 (fault isolation), then it assigns the active role to -VNFC1. It may report an error to the VNFM as well. - -Since there is no state information to recover, VNFC1 can accept the active role right away -and resume providing the Provided NF service. Thus the service outage is over. If the HA manager -reported an error to the VNFM it should clear it at this point. - -The key points in this scenario are: - -* The spare VNFC may be instantiated only once the failure of active VNFC is detected. -* As a result the HA manager's role might be limited to life-cycle management, i.e. no role - assignment is needed if the VNFCs provide the service as soon as they are started up. -* Accordingly the HA management could be part of a generic VNFM provided it is capable of detecting - the VNFC failures. Besides the service users, the VNFC failure may not be detectable at any other - part of the system. -* Also there could be multiple active VNFCs sharing the load of Provided NF and the spare/standby - may protect all of them. -* Reporting the service failure to the VNFM is optional as the HA manager is in charge of recovering - the service and it is aware of the redundancy needed to do so. - - -2.6 Use Case 6: VM failure in a stateless VNF with redundancy -============================================================ - - -Similarly to use case 5, use case 6 represents a stateless VNF composed of VNFC1 and VNFC2, -which are managed by an HA manager within the VNF. The HA manager assigns the active role to -provide the Provided NF to one of the VNFCs while the other remains a spare meaning that it has -no state information for the Provided NF (Fig 11) and it could replace any other VNFC capable -of providing the Provided NF service. - -As opposed to use case 5 in this use case the VM hosting one of the VNFCs fails. This failure is -detected and handled by the HA manager as well as the VIM. - - -.. figure:: images/Slide14.png - :alt: VM failure in a stateless VNF with redundancy - :figclass: align-center - - Fig 11. VM failure in a stateless VNF with redundancy - - -.. figure:: images/StatelessVNF-VMfailure.png - :alt: MSC of the VM failure in a stateless VNF with redundancy - :figclass: align-center - - Fig 12. Sequence of events for use case 6 - - -Initially VNFC2 provides the Provided NF while VNFC1 is idle or might not have been instantiated -yet (Fig 12) as in use case 5. - -When VM2 fails VNFC2 fails with it and the Provided NF is interrupted. The failure is detected by -the HA manager and by the VIM simultaneously and independently. - -The HA manager's first reaction is trying to clean up VNFC2 to isolate the fault. This is considered to -be successful as soon as the disappearance of the VM is confirmed. -After this the HA manager assigns the active role to VNFC1. It may report the error to the VNFM as well -requesting a VM repair. - -Since there is no state information to recover, VNFC1 can accept the assignment right away -and resume the Provided NF service. Thus the service outage is over. If the HA manager reported -an error to the VNFM for the service it should clear it at this point. - -Simultaneously the VM failure is detected in the NFVI and reported to the VIM, which reports it -to the VNFM, if the VNFM hasn't requested a repair yet. If the VNFM requested the VM repair or if -it acknowledges the repair, the VIM restarts the VM. - -Once the VM is up the VIM reports it to the VNFM, which in turn may restart the VNFC if needed. - - -The key points in this scenario are: - -* The spare VNFC may be instantiated only after the detection of the failure of the active VNFC. -* As a result the HA manager's role might be limited to life-cycle management, i.e. no role - assignment is needed if the VNFC provides the service as soon as it is started up. -* Accordingly the HA management could be part of a generic VNFM provided if it is capable of detecting - failures in/of the VNFC and managing its life-cycle. -* Also there could be multiple active VNFCs sharing the load of Provided NF and the spare/standby - may protect all of them. -* The VNFM may not know for sure that the VM failed until the VIM reports this. It also cannot - distinguish host, hypervisor and host OS failures. Thus the VIM should report/alarm and log - VM, hypervisor, and physical host failures. The use cases for these failures are - similar with respect to each Provided NF. -* The VM repair also should start with the fault isolation as appropriate for the actual - failed entity, e.g. if the VM failed due to a host failure a host needs to be fenced first. -* The negotiation between the VNFM and the VIM may be replaced by configured repair actions. -* Reporting the service failure to the VNFM is optional as the HA manager is in charge recovering - the service and it is aware of the redundancy needed to do so. - - - -2.7 Use Case 7: VNFC failure in a stateless VNF with no redundancy -================================================================== - -Use case 7 represents a stateless VNF composed of a single VNFC, i.e. with no redundancy. -The VNF and in particular its VNFC is managed by the VNFM through managing its life-cycle (Fig 13). - -In this use case the VNFC fails. This failure is detected and handled by the VNFM. This use case -requires that the VNFM can detect the failures in the VNF or they are reported to the VNFM. - -The failure is only detectable at the VNFM level and it is handled by the VNFM restarting the VNFC. - - -.. figure:: images/Slide16.png - :alt: VNFC failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 13. VNFC failure in a stateless VNF with no redundancy - - -.. figure:: images/StatelessVNF-VNFCfailureNoRed.png - :alt: MSC of the VNFC failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 14. Sequence of events for use case 7 - -The VNFC is providing the Provided NF when it fails (Fig 14). This failure is detected or reported to -the VNFM, which has to clean up the VNFC to isolate the fault. After cleanup success it can proceed -with restarting the VNFC, which as soon as it is up it starts to provide the Provided NF -as there is no state to recover. - -Thus the service outage is over, but it has included the entire time needed to restart the VNFC. -Considering that the VNF is stateless this may not be significant still. - - -The key points in this scenario are: - -* The VNFM has to have the means to detect VNFC failures and manage its life-cycle appropriately. - This is not required if the VNF comes with its availability management, but this is very unlikely - for such stateless VNFs. -* The Provided NF can be resumed as soon as the VNFC is restarted, i.e. the restart time determines - the outage. -* In case multiple VNFCs are used they should not interfere with one another, they should - operate independently. - - -2.8 Use Case 8: VM failure in a stateless VNF with no redundancy -================================================================ - -Use case 8 represents the same stateless VNF composed of a single VNFC as use case 7, i.e. with -no redundancy. The VNF and in particular its VNFC is managed by the VNFM through managing its -life-cycle (Fig 15). - -In this use case the VM hosting the VNFC fails. This failure is detected and handled by the VNFM -as well as by the VIM. - - -.. figure:: images/Slide17.png - :alt: VM failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 15. VM failure in a stateless VNF with no redundancy - - -.. figure:: images/StatelessVNF-VMfailureNoRed.png - :alt: MSC of the VM failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 16. Sequence of events for use case 8 - -The VNFC is providing the Provided NF when the VM hosting the VNFC fails (Fig 16). - -This failure may be detected or reported to the VNFM as a failure of the VNFC. The VNFM may -not be aware at this point that it is a VM failure. Accordingly its first reaction as in use case 7 -is to clean up the VNFC to isolate the fault. Since the VM is gone, this cannot succeed and the VNFM -becomes aware of the VM failure through this or it is reported by the VIM. In either case it has to wait -with the repair of the VMFC until the VM becomes available again. - -Meanwhile the VIM also detects the VM failure and reports it to the VNFM unless the VNFM has already -requested the VM repair. After the VNFM confirming the VM repair the VIM restarts the VM and reports -the successful repair to the VNFM, which in turn can start the VNFC hosted on it. - - -Thus the recovery of the Provided NF includes the restart time of the VM and of the VNFC. - -The key points in this scenario are: - -* The VNFM has to have the means to detect VNFC failures and manage its life-cycle appropriately. - This is not required if the VNF comes with its availability management, but this is very unlikely - for such stateless VNFs. -* The Provided NF can be resumed only after the VNFC is restarted on the repaired VM, i.e. the - restart time of the VM and the VNFC determines the outage. -* In case multiple VNFCs are used they should not interfere with one another, they should - operate independently. -* The VNFM may not know for sure that the VM failed until the VIM reports this. It also cannot - distinguish host, hypervisor and host OS failures. Thus the VIM should report/alarm and log - VM, hypervisor, and physical host failures. The use cases for these failures are - similar with respect to each Provided NF. -* The VM repair also should start with the fault isolation as appropriate for the actual - failed entity, e.g. if the VM failed due to a host failure the host needs to be fenced first. -* The repair negotiation between the VNFM and the VIM may be replaced by configured repair actions. -* VM level redundancy, i.e. running a standby or spare VM in the NFVI would allow faster service - recovery for this use case, but by itself it may not protect against VNFC level failures. I.e. - VNFC level error detection is still required. - -2.9 Use Case 9: Repeated VNFC failure in a stateless VNF with no redundancy -=========================================================================== - -Finally use case 9 represents again a stateless VNF composed of a single VNFC as in use case 7, i.e. -with no redundancy. The VNF and in particular its VNFC is managed by the VNFM through managing its -life-cycle. - -In this use case the VNFC fails repeatedly. This failure is detected and handled by the VNFM, -but results in no resolution of the fault (Fig 17) because the VNFC is manifesting a fault, -which is not in its scope. I.e. the fault is propagating to the VNFC from a faulty VM or host, -for example. Thus the VNFM cannot resolve the problem by itself. - - -.. figure:: images/Slide19.png - :alt: Repeated VNFC failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 17. VM failure in a stateless VNF with no redundancy - - -To handle this case the failure handling needs to be escalated to the a bigger fault zone -(or fault domain), i.e. a scope within which the faults may propagate and manifest. In case of the -VNF the bigger fault zone is the VM and the facilities hosting it, all managed by the VIM. - -Thus the VNFM should request the repair from the VIM (Fig 18). - -Since the VNFM is only aware of the VM, it needs to report an error on the VM and it is the -VIM's responsibility to sort out what might be the scope of the actual fault depending on other -failures and error reports in its scope. - - -.. figure:: images/Slide20.png - :alt: Escalation of repeated VNFC failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 18. VM failure in a stateless VNF with no redundancy - - -.. figure:: images/StatelessVNF-VNFCfailureNoRed-Escalation.png - :alt: MSC of the VM failure in a stateless VNF with no redundancy - :figclass: align-center - - Fig 19. Sequence of events for use case 9 - - -This use case starts similarly to use case 7, i.e. the VNFC is providing the Provided NF when it fails -(Fig 17). -This failure is detected or reported to the VNFM, which cleans up the VNFC to isolate the fault. -After successful cleanup the VNFM proceeds with restarting the VNFC, which as soon as it is up -starts to provide the Provided NF again as in use case 7. - -However the VNFC failure occurs N times repeatedly within some Probation time for which the VNFM starts -the timer when it detects the first failure of the VNFC. When the VNFC fails once more still within the -probation time the Escalation counter maximum is exceeded and the VNFM reports an error to the VIM on -the VM hosting the VNFC as obviously cleaning up and restarting the VNFC did not solve the problem. - -When the VIM receives the error report for the VM it has to isolate the fault by cleaning up at least -the VM. After successful cleanup it can restart the VM and once it is up report the VM repair to the VNFM. -At this point the VNFM can restart the VNFC, which in turn resumes the Provided VM. - -In this scenario the VIM needs to evaluate what may be the scope of the fault to determine what entity -needs a repair. For example, if it has detected VM failures on that same host, or other VNFMs -reported errors on VMs hosted on the same host, it should consider that the entire host needs a repair. - - -The key points in this scenario are: - -* The VNFM has to have the means to detect VNFC failures and manage its life-cycle appropriately. - This is not required if the VNF comes with its availability management, but this is very unlikely - for such stateless VNFs. -* The VNFM needs to correlate VNFC failures over time to be able to detect failure of a bigger fault zone. - One way to do so is through counting the failures within a probation time. -* The VIM cannot detect all failures caused by faults in the entities under its control. It should be - able to receive error reports and correlate these error reports based on the dependencies - of the different entities. -* The VNFM does not know the source of the failure, i.e. the faulty entity. -* The VM repair should start with the fault isolation as appropriate for the actual - failed entity, e.g. if the VM failed due to a host failure the host needs to be fenced first. - -******************** -3 Concluding remarks -******************** - -This use case document outlined the model and some failure modes for NFV systems. These are an -initial list. The OPNFV HA project team is continuing to grow the list of use cases and will -issue additional documents going forward. The basic use cases and service availability considerations -help define the key considerations for each use case taking into account the impact on the end service. -The use case document along with the requirements documents and gap analysis help set context for -engagement with various upstream projects. diff --git a/UseCases/UseCases_for_Network_Nodes.rst b/UseCases/UseCases_for_Network_Nodes.rst deleted file mode 100644 index bc9266a..0000000 --- a/UseCases/UseCases_for_Network_Nodes.rst +++ /dev/null @@ -1,157 +0,0 @@ -4 High Availability Scenarios for Network Nodes -=============================================== - -4.1 Network nodes and HA deployment ------------------------------------ - -OpenStack network nodes contain: Neutron DHCP agent, Neutron L2 agent, Neutron L3 agent, Neutron LBaaS -agent and Neutron Metadata agent. The DHCP agent provides DHCP services for virtual networks. The -metadata agent provides configuration information such as credentials to instances. Note that the -L2 agent cannot be distributed and highly available. Instead, it must be installed on each data -forwarding node to control the virtual network drivers such as Open vSwitch or Linux Bridge. One L2 -agent runs per node and controls its virtual interfaces. - -A typical HA deployment of network nodes can be achieved in Fig 20. Here shows a two nodes cluster. -The number of the nodes is decided by the size of the cluster. It can be 2 or more. More details can be -achieved from each agent's part. - - -.. figure:: images_network_nodes/Network_nodes_deployment.png - :alt: HA deployment of network nodes - :figclass: align-center - - Fig 20. A typical HA deployment of network nodes - - -4.2 DHCP agent --------------- - -The DHCP agent can be natively highly available. Neutron has a scheduler which lets you run multiple -agents across nodes. You can configure the dhcp_agents_per_network parameter in the neutron.conf file -and set it to X (X >=2 for HA, default is 1). - -If the X is set to 2, as depicted in Fig 21 three tenant networks (there can be multiple tenant networks) -are used as an example, six DHCP agents are deployed in two nodes for three networks, they are -all active. Two dhcp1s serve one network, dhcp2s and dhcp3s serve other two different networks. In a -network, all DHCP traffic is broadcast, DHCP servers race to offer IP. All the servers will update the -lease tables. In Fig 22, when the agent(s) in Node1 doesn't work which can be caused by software -failure or hardware failure, the dhcp agent(s) on Node2 will continue to offer IP for the network. - - -.. figure:: images_network_nodes/DHCP_deployment.png - :alt: HA deployment of DHCP agents - :figclass: align-center - - Fig 21. Natively HA deployment of DHCP agents - - -.. figure:: images_network_nodes/DHCP_failure.png - :alt: Failure of DHCP agents - :figclass: align-center - - Fig 22. Failure of DHCP agents - - -4.3 L3 agent ------------- - -The L3 agent is also natively highly available. To achieve HA, it can be configured in the neutron.conf -file. - -.. code-block:: bash - - l3_ha = True # All routers are highly available by default - - allow_automatic_l3agent_failover = True # Set automatic L3 agent failover for routers - - max_l3_agents_per_router = 2 # Maximum number of network nodes to use for the HA router - - min_l3_agents_per_router = 2 # Minimum number of network nodes to use for the HA router. A new router - can be created only if this number of network nodes are available. - -According to the neutron.conf file, the L3 agent scheduler supports Virtual Router Redundancy -Protocol (VRRP) to distribute virtual routers across multiple nodes (e.g. 2). The scheduler will choose -a number between the maximum and the minimum number according scheduling algorithm. VRRP is implemented -by Keepalived. - -As depicted in Fig 23, both L3 agents in Node1 and Node2 host vRouter 1 and vRouter 2. In Node 1, -vRouter 1 is active and vRouter 2 is standby (hot standby). In Node2, vRouter 1 is standby and -vRouter 2 is active. For the purpose of reducing the load, two actives are deployed in two Nodes -alternatively. In Fig 24, Keepalived will be used to manage the VIP interfaces. One instance of -keepalived per virtual router, then one per namespace. 169.254.192.0/18 is a dedicated HA network -which is created in order to isolate the administrative traffic from the tenant traffic, each vRouter -will be connected to this dedicated network via an HA port. More details can be achieved from the -Reference at the bottom. - - -.. figure:: images_network_nodes/L3_deployment.png - :alt: HA deployment of L3 agents - :figclass: align-center - - Fig 23. Natively HA deployment of L3 agents - - -.. figure:: images_network_nodes/L3_ha_principle.png - :alt: HA principle of L3 agents - :figclass: align-center - - Fig 24. Natively HA principle of L3 agents - - -In Fig 25, when vRouter 1 in Node1 is down which can be caused by software failure or hardware failure, -the Keepalived will detect the failure and the standby will take over to be active. In order to keep the -TCP connection, Conntrackd is used to maintain the TCP sessions going through the router. One instance -of conntrackd per virtual router, then one per namespace. After then, a rescheduling procedure will be -triggered to respawn the failed virtual router to another l3 agent as standby. All the workflows is -depicted in Fig 26. - - -.. figure:: images_network_nodes/L3_failure.png - :alt: Failure of L3 agents - :figclass: align-center - - Fig 25. Failure of L3 agents - - -.. figure:: images_network_nodes/L3_ha_workflow.png - :alt: HA workflow of L3 agents - :figclass: align-center - - Fig 26. HA workflow of L3 agents - - -4.4 LBaaS agent and Metadata agent ----------------------------------- - -Currently, no native feature is provided to make the LBaaS agent highly available using the defaul -plug-in HAProxy. A common way to make HAProxy highly available is to use Pacemaker. - - -.. figure:: images_network_nodes/LBaaS_deployment.png - :alt: HA deployment of LBaaS agents - :figclass: align-center - - Fig 27. HA deployment of LBaaS agents using Pacemaker - - -As shown in Fig 27 HAProxy and pacemaker are deployed in both of the network nodes. The number of network -nodes can be 2 or more. It depends on your cluster. HAProxy in Node 1 is the master and the VIP is in -Node 1. Pacemaker monitors the liveness of HAProxy. - - -.. figure:: images_network_nodes/LBaaS_failure.png - :alt: Failure of LBaaS agents - :figclass: align-center - - Fig 28. Failure of LBaaS agents - - -As shown in Fig 28 when HAProxy in Node1 falls down which can be caused by software failure or hardware -failure, Pacemaker will fail over HAProxy and the VIP to Node 2. - -Note that the default plug-in HAProxy only supports TCP and HTTP. - -No native feature is available to make Metadata agent highly available. At this time, the Active/Passive -solution exists to run the neutron metadata agent in failover mode with Pacemaker. The deployment and -failover procedure can be the same as the case of LBaaS. - diff --git a/UseCases/images/Slide10.png b/UseCases/images/Slide10.png Binary files differdeleted file mode 100644 index b3545e8..0000000 --- a/UseCases/images/Slide10.png +++ /dev/null diff --git a/UseCases/images/Slide11.png b/UseCases/images/Slide11.png Binary files differdeleted file mode 100644 index 3aa5f67..0000000 --- a/UseCases/images/Slide11.png +++ /dev/null diff --git a/UseCases/images/Slide13.png b/UseCases/images/Slide13.png Binary files differdeleted file mode 100644 index 207c4a7..0000000 --- a/UseCases/images/Slide13.png +++ /dev/null diff --git a/UseCases/images/Slide14.png b/UseCases/images/Slide14.png Binary files differdeleted file mode 100644 index e6083c9..0000000 --- a/UseCases/images/Slide14.png +++ /dev/null diff --git a/UseCases/images/Slide16.png b/UseCases/images/Slide16.png Binary files differdeleted file mode 100644 index 484ffa2..0000000 --- a/UseCases/images/Slide16.png +++ /dev/null diff --git a/UseCases/images/Slide17.png b/UseCases/images/Slide17.png Binary files differdeleted file mode 100644 index 7240aaa..0000000 --- a/UseCases/images/Slide17.png +++ /dev/null diff --git a/UseCases/images/Slide19.png b/UseCases/images/Slide19.png Binary files differdeleted file mode 100644 index 7e3c10b..0000000 --- a/UseCases/images/Slide19.png +++ /dev/null diff --git a/UseCases/images/Slide20.png b/UseCases/images/Slide20.png Binary files differdeleted file mode 100644 index 2e9759b..0000000 --- a/UseCases/images/Slide20.png +++ /dev/null diff --git a/UseCases/images/Slide4.png b/UseCases/images/Slide4.png Binary files differdeleted file mode 100644 index a701f42..0000000 --- a/UseCases/images/Slide4.png +++ /dev/null diff --git a/UseCases/images/Slide6.png b/UseCases/images/Slide6.png Binary files differdeleted file mode 100644 index 04a904f..0000000 --- a/UseCases/images/Slide6.png +++ /dev/null diff --git a/UseCases/images/StatefullVNF-VMfailure.png b/UseCases/images/StatefullVNF-VMfailure.png Binary files differdeleted file mode 100644 index 2f62232..0000000 --- a/UseCases/images/StatefullVNF-VMfailure.png +++ /dev/null diff --git a/UseCases/images/StatefullVNF-VMfailureNoRed.png b/UseCases/images/StatefullVNF-VMfailureNoRed.png Binary files differdeleted file mode 100644 index 6f3058d..0000000 --- a/UseCases/images/StatefullVNF-VMfailureNoRed.png +++ /dev/null diff --git a/UseCases/images/StatefullVNF-VNFCfailure.png b/UseCases/images/StatefullVNF-VNFCfailure.png Binary files differdeleted file mode 100644 index 9021f2d..0000000 --- a/UseCases/images/StatefullVNF-VNFCfailure.png +++ /dev/null diff --git a/UseCases/images/StatefullVNF-VNFCfailureNoRed.png b/UseCases/images/StatefullVNF-VNFCfailureNoRed.png Binary files differdeleted file mode 100644 index 4fd7e2e..0000000 --- a/UseCases/images/StatefullVNF-VNFCfailureNoRed.png +++ /dev/null diff --git a/UseCases/images/StatelessVNF-VMfailure.png b/UseCases/images/StatelessVNF-VMfailure.png Binary files differdeleted file mode 100644 index 9b94183..0000000 --- a/UseCases/images/StatelessVNF-VMfailure.png +++ /dev/null diff --git a/UseCases/images/StatelessVNF-VMfailureNoRed.png b/UseCases/images/StatelessVNF-VMfailureNoRed.png Binary files differdeleted file mode 100644 index 2a14b67..0000000 --- a/UseCases/images/StatelessVNF-VMfailureNoRed.png +++ /dev/null diff --git a/UseCases/images/StatelessVNF-VNFCfailure.png b/UseCases/images/StatelessVNF-VNFCfailure.png Binary files differdeleted file mode 100644 index f2dcc3b..0000000 --- a/UseCases/images/StatelessVNF-VNFCfailure.png +++ /dev/null diff --git a/UseCases/images/StatelessVNF-VNFCfailureNoRed-Escalation.png b/UseCases/images/StatelessVNF-VNFCfailureNoRed-Escalation.png Binary files differdeleted file mode 100644 index 6719177..0000000 --- a/UseCases/images/StatelessVNF-VNFCfailureNoRed-Escalation.png +++ /dev/null diff --git a/UseCases/images/StatelessVNF-VNFCfailureNoRed.png b/UseCases/images/StatelessVNF-VNFCfailureNoRed.png Binary files differdeleted file mode 100644 index a0970fc..0000000 --- a/UseCases/images/StatelessVNF-VNFCfailureNoRed.png +++ /dev/null diff --git a/UseCases/images_network_nodes/DHCP_deployment.png b/UseCases/images_network_nodes/DHCP_deployment.png Binary files differdeleted file mode 100755 index 90bb740..0000000 --- a/UseCases/images_network_nodes/DHCP_deployment.png +++ /dev/null diff --git a/UseCases/images_network_nodes/DPCH_failure.png b/UseCases/images_network_nodes/DPCH_failure.png Binary files differdeleted file mode 100755 index 07a51f8..0000000 --- a/UseCases/images_network_nodes/DPCH_failure.png +++ /dev/null diff --git a/UseCases/images_network_nodes/L3_deployment.png b/UseCases/images_network_nodes/L3_deployment.png Binary files differdeleted file mode 100755 index ff573b6..0000000 --- a/UseCases/images_network_nodes/L3_deployment.png +++ /dev/null diff --git a/UseCases/images_network_nodes/L3_failure.png b/UseCases/images_network_nodes/L3_failure.png Binary files differdeleted file mode 100755 index 57485ad..0000000 --- a/UseCases/images_network_nodes/L3_failure.png +++ /dev/null diff --git a/UseCases/images_network_nodes/L3_ha_principle.png b/UseCases/images_network_nodes/L3_ha_principle.png Binary files differdeleted file mode 100755 index 59a3161..0000000 --- a/UseCases/images_network_nodes/L3_ha_principle.png +++ /dev/null diff --git a/UseCases/images_network_nodes/L3_ha_workflow.png b/UseCases/images_network_nodes/L3_ha_workflow.png Binary files differdeleted file mode 100755 index d923f4f..0000000 --- a/UseCases/images_network_nodes/L3_ha_workflow.png +++ /dev/null diff --git a/UseCases/images_network_nodes/LBaaS_deployment.png b/UseCases/images_network_nodes/LBaaS_deployment.png Binary files differdeleted file mode 100755 index d4e5929..0000000 --- a/UseCases/images_network_nodes/LBaaS_deployment.png +++ /dev/null diff --git a/UseCases/images_network_nodes/LBaaS_failure.png b/UseCases/images_network_nodes/LBaaS_failure.png Binary files differdeleted file mode 100755 index 5262fd0..0000000 --- a/UseCases/images_network_nodes/LBaaS_failure.png +++ /dev/null diff --git a/UseCases/images_network_nodes/Network_nodes_deployment.png b/UseCases/images_network_nodes/Network_nodes_deployment.png Binary files differdeleted file mode 100755 index bb0f3db..0000000 --- a/UseCases/images_network_nodes/Network_nodes_deployment.png +++ /dev/null |