From ba452c88a1f117b022f7d1207212a80a9447687a Mon Sep 17 00:00:00 2001 From: Kecheng_Guo <1552778@tongji.edu.cn> Date: Mon, 22 Oct 2018 20:26:16 +0800 Subject: Add Requirement Analyze for HA API in HA Analyze Document JIRA: HA-38 Add HA Scenario Analyze for HA API Design in the HA Analyze Document, in order to make links between the HA Analyze Document and other documents in HA project Change-Id: I7856452745e7c169cfd0e7dfbfba19c120804af9 Signed-off-by: Kecheng_Guo <1552778@tongji.edu.cn> --- R6_HA_Analysis/HA_Analysis.rst | 180 +++++++++++++++++++-- .../images/Heartbeating_and_Healthchecks.png | Bin 0 -> 74066 bytes R6_HA_Analysis/images/VM_HA_Analysis.png | Bin 0 -> 37675 bytes .../VM_Peer_State_Notification_and_Messaging.png | Bin 0 -> 40788 bytes 4 files changed, 163 insertions(+), 17 deletions(-) create mode 100644 R6_HA_Analysis/images/Heartbeating_and_Healthchecks.png create mode 100644 R6_HA_Analysis/images/VM_HA_Analysis.png create mode 100644 R6_HA_Analysis/images/VM_Peer_State_Notification_and_Messaging.png diff --git a/R6_HA_Analysis/HA_Analysis.rst b/R6_HA_Analysis/HA_Analysis.rst index 06c0487..9df3821 100644 --- a/R6_HA_Analysis/HA_Analysis.rst +++ b/R6_HA_Analysis/HA_Analysis.rst @@ -153,8 +153,154 @@ The next section will illustrate detailed analysis of HA requirements on these l 5.3 Virtual Infrastructure HA >>>>>>>>>>>>>>>>>> -.. TBD +The Virtual Infrastructure in Openstack contains the Guest VMs and the Host VMs. +This part describes a set of new optional capabilities where the OpenStack Cloud messages into the Guest +VMs in order to provide improved Availability of the Host VMs. + +Table 2 shows the potential faults of VMs and corresponding initial solution capabilities or methods. + +*Table 2. Potential Faults of VMs and the initial solution capabilities* + ++---------------------------+------------------------------------+--------------------------------------------+ +| Fault | Description | solution capabilities | ++===========================+====================================+============================================+ +| VM faults | General internal VM faults | VM Heartbeating and Health Checking | ++---------------------------+------------------------------------+--------------------------------------------+ +| VM Server Group faults | such as split brain | VM Peer State Notification and Messaging | ++---------------------------+------------------------------------+--------------------------------------------+ + + +.. figure:: images/VM_HA_Analysis.png + :alt: VM HA + :figclass: align-center + + Fig 3. VM HA Analysis + +NOTE: A Server Group here is the OpenStack Nova Server Group concept where VMs +are grouped together for purposes of scheduling. E.g. A specific Server Group +instance can specify whether the VMs within the group should be scheduled to +run on the same compute host or different compute hosts. A 'peer' VM in the +context of this section refers to a VM within the same Nova Server Group. + +The initial set of new capabilities include: enabling the +detection of and recovery from internal VM faults and providing +a simple out-of-band messaging service to prevent scenarios such +as split brain. + +More detailed description is located in R5_HA_API/OPNFV_HA_Guest_APIs-Overview_HLD.rst in this project. + +The Host-to-Guest messaging APIs used by the services discussed +in this Virtual Infrastructure HA part use a JSON-formatted application messaging layer +on top of a virtio serial device between QEMU on the OpenStack Host +and the Guest VM. Use of the virtio serial device provides a +simple, direct communication channel between host and guest which is +independent of the Guest's L2/L3 networking. + +The upper layer JSON messaging format is actually structured as a +hierarchical JSON format containing a Base JSON Message Layer and an +Application JSON Message Layer: + +- the Base Layer provides the ability to multiplex different groups of message types on top of a single virtio serial device +e.g. + + + heartbeating and healthchecks, + + server group messaging, + +and + +- the Application Layer provides the specific message types and fields of a particular group of message types. + + +VM Heartbeating and Health Checking +:::::::::::::::::::::::::::::::::::::: + +.. figure:: images/Heartbeating_and_Healthchecks.png + :alt: Heartbeating and Healthchecks + :figclass: align-center + + Fig 4. Heartbeating and Healthchecks + +VM Heartbeating and Health Checking provides a heartbeat service to enhance +the monitoring of the health of guest application(s) within a VM running +under the OpenStack Cloud. Loss of heartbeat or a failed health check status +will result in a fault event being reported to OPNFV's DOCTOR infrastructure +for alarm identification, impact analysis and reporting. This would then enable +VNF Managers (VNFMs) listening to OPNFV's DOCTOR External Alarm Reporting through +Telemetry's AODH, to initiate any required fault recovery actions. + +Guest heartbeat works on a challenge response model. The OpenStack Guest Heartbeat +Service on the compute node will challenge the registered Guest VM daemon with a +message each interval. The registered Guest VM daemon must respond prior to the +next interval with a message indicating good health. If the OpenStack Host does +not receive a valid response, or if the response specifies that the VM is in ill +health, then a fault event for the Guest VM is reported to the OpenStack Guest +Heartbeat Service on the controller node which will report the event to OPNFV's +DOCTOR (i.e. thru the Doctor SouthBound (SB) APIs). + +In summary, the Guest Heartbeating Messaging Specification is quite simple, +including the following PDUs: Init, Init-Ack, Challenge-Request, +Challenge-Response, Exit. The Challenge-Response returning a healthy / +not-healthy boolean. + +The registered Guest VM daemon's response to the challenge can be as simple +as just immediately responding with OK. This alone allows for detection of +a failed or hung QEMU/KVM instance, or a failure of the OS within the VM to +schedule the registered Guest VM's daemon or failure to route basic IO within +the Guest VM. + +However the registered Guest VM daemon's response to the challenge can be more +complex, running anything from a quick simple sanity check of the health of +applications running in the Guest VM, to a more thorough audit of the +application state and data. In either case returning the status of the +health check enables the OpenStack host to detect and report the event in order +to initiate recovery from application level errors or failures within the Guest VM. + + +VM Peer State Notification and Messaging +:::::::::::::::::::::::::::::::::::::::::::: + +.. figure:: images/VM_Peer_State_Notification_and_Messaging.png + :alt: VM Peer State Notification and Messaging + :figclass: align-center + Fig 5. VM Peer State Notification and Messaging + +Server Group State Notification and Messaging is a service to provide +simple low-bandwidth datagram messaging and notifications for servers that +are part of the same server group. This messaging channel is available +regardless of whether IP networking is functional within the server, and +it requires no knowledge within the server about the other members of the group. + +This Server Group Messaging service provides three types of messaging: + +- Broadcast: this allows a server to send a datagram (size of up to 3050 bytes) + to all other servers within the server group. +- Notification: this provides servers with information about changes to the + (Nova) state of other servers within the server group. +- Status: this allows a server to query the current (Nova) state of all servers within + the server group (including itself). + +A Server Group Messaging entity on both the controller node and the compute nodes manage +the routing of of VM-to-VM messages through the platform, leveraging Nova to determine +Server Group membership and compute node locations of VMs. The Server Group Messaging +entity on the controller also listens to Nova VM state change notifications and querys +VM state data from Nova, in order to provide the VM query and notification functionality +of this service. + +This service is not intended for high bandwidth or low-latency operations. It is best-effort, +not reliable. Applications should do end-to-end acks and retries if they care about reliability. + +This service provides building block type capabilities for the Guest VMs that +contribute to higher availability of the VMs in the Guest VM Server Group. Notifications +of VM Status changes potentially provide a faster and more accurate notification +of failed peer VMs than traditional peer VM monitoring over Tenant Networks. While +the Broadcast Messaging mechanism provides an out-of-band messaging mechanism to +monitor and control a peer VM under fault conditions; e.g. providing the ability to +avoid potential split brain scenarios between 1:1 VMs when faults in Tenant +Networking occur. + + + 5.4 VIM HA >>>>>>>>>>>>>>>>>> @@ -171,13 +317,13 @@ types: - **Subcomponents**: Components that implement VIM functions, which are called by Entry Point Components but not by users directly. -Table 2 shows the potential faults that may happen on VIM layer. Currently the main focus of +Table 3 shows the potential faults that may happen on VIM layer. Currently the main focus of VIM HA is the service crash of VIM components, which may occur on all types of VIM components. To prevent VIM services from being unavailable, Active/Active Redundancy, Active/Passive Redundancy and Message Queue are used for different types of VIM components, as is shown in -figure 3. +figure 6. -*Table 2. Potential Faults in VIM level* +*Table 3. Potential Faults in VIM level* +------------+------------------+-------------------------------------------------+----------------+ | Service | Fault | Description | Severity | @@ -189,7 +335,7 @@ figure 3. :alt: VIM HA Analysis :figclass: align-center - Fig 3. VIM HA Analysis + Fig 6. VIM HA Analysis Active/Active Redundancy @@ -202,13 +348,13 @@ load balancer such as HAProxy. When one of the redundant VIM component fails, the load balancer should be aware of the instance failure, and then isolate the failed instance from being called until it is recovered. -The requirement decomposition of Active/Active Redundancy is shown in Figure 4. +The requirement decomposition of Active/Active Redundancy is shown in Figure 7. .. figure:: images/Active_Active_Redundancy.png :alt: Active/Active Redundancy Requirement Decomposition :figclass: align-center - Fig 4. Active/Active Redundancy Requirement Decomposition + Fig 7. Active/Active Redundancy Requirement Decomposition The following requirements are elicited for VIM Active/Active Redundancy: @@ -226,7 +372,7 @@ recovered. Table 3 shows the current VIM components using Active/Active Redundancy and the corresponding HA test cases to verify them. -*Table 3. VIM Components using Active/Active Redundancy* +*Table 4. VIM Components using Active/Active Redundancy* +-------------------+-------------------------------------------------------+----------------------+ | Component | Description | Related HA Test Case | @@ -273,13 +419,13 @@ Pacemaker or Corosync) monitors these components, bringing the backup online as When the main instance of a VIM component is failed, the cluster manager should be aware of the failure and switch the backup instance online. And the failed instance should also be recovered to another backup instance. The requirement decomposition of Active/Passive Redundancy is shown -in Figure 5. +in Figure 8. .. figure:: images/Active_Passive_Redundancy.png :alt: Active/Passive Redundancy Requirement Decomposition :figclass: align-center - Fig 5. Active/Passive Redundancy Requirement Decomposition + Fig 8. Active/Passive Redundancy Requirement Decomposition The following requirements are elicited for VIM Active/Passive Redundancy: @@ -296,7 +442,7 @@ a backup instance. Table 4 shows the current VIM components using Active/Passive Redundancy and the corresponding HA test cases to verify them. -*Table 4. VIM Components using Active/Passive Redundancy* +*Table 5. VIM Components using Active/Passive Redundancy* +-------------------+-------------------------------------------------------+----------------------+ | Component | Description | Related HA Test Case | @@ -312,18 +458,18 @@ Message Queue :::::::::::::::::::::::::::: Message Queue provides an asynchronous communication protocol. In Openstack, some projects ( like Nova, Cinder) use Message Queue to call their sub components. Although Message Queue -itself is not an HA mechanism, how it works ensures the high availability when redundant -components subscribe to the Message Queue. When a VIM sub component fails, since there are +itself is not an HA mechanism, how it works ensures the high availaibility when redundant +components subscribe to the Messsage Queue. When a VIM sub component fails, since there are other redundant components are subscribing to the Message Queue, requests still can be processed. And fault isolation can also be archived since failed components won't fetch requests actively. -Also, the recovery of failed components is required. Figure 6 shows the requirement +Also, the recovery of failed components is required. Figure 9 shows the requirement decomposition of Message Queue. .. figure:: images/Message_Queue.png :alt: Message Queue Requirement Decomposition :figclass: align-center - Fig 6. Message Queue Redundancy Requirement Decomposition + Fig 9. Message Queue Redundancy Requirement Decomposition The following requirements are elicited for Message Queue: @@ -337,7 +483,7 @@ implemented by the installer. Table 5 shows the current VIM components using Message Queue and the corresponding HA test cases to verify them. -*Table 5. VIM Components using Messaging Queue* +*Table 6. VIM Components using Messaging Queue* +-------------------+-------------------------------------------------------+----------------------+ | Component | Description | Related HA Test Case | @@ -403,4 +549,4 @@ to verify them. - Openstack High Availability Guide: https://docs.openstack.org/ha-guide/ -- Highly Available (Mirrored) Queues: https://www.rabbitmq.com/ha.html \ No newline at end of file +- Highly Available (Mirrored) Queues: https://www.rabbitmq.com/ha.html diff --git a/R6_HA_Analysis/images/Heartbeating_and_Healthchecks.png b/R6_HA_Analysis/images/Heartbeating_and_Healthchecks.png new file mode 100644 index 0000000..cd7a551 Binary files /dev/null and b/R6_HA_Analysis/images/Heartbeating_and_Healthchecks.png differ diff --git a/R6_HA_Analysis/images/VM_HA_Analysis.png b/R6_HA_Analysis/images/VM_HA_Analysis.png new file mode 100644 index 0000000..e263e60 Binary files /dev/null and b/R6_HA_Analysis/images/VM_HA_Analysis.png differ diff --git a/R6_HA_Analysis/images/VM_Peer_State_Notification_and_Messaging.png b/R6_HA_Analysis/images/VM_Peer_State_Notification_and_Messaging.png new file mode 100644 index 0000000..7614e19 Binary files /dev/null and b/R6_HA_Analysis/images/VM_Peer_State_Notification_and_Messaging.png differ -- cgit 1.2.3-korg