diff options
Diffstat (limited to 'Storage-HA-Scenarios.rst')
-rw-r--r-- | Storage-HA-Scenarios.rst | 442 |
1 files changed, 0 insertions, 442 deletions
diff --git a/Storage-HA-Scenarios.rst b/Storage-HA-Scenarios.rst deleted file mode 100644 index b8b37a3..0000000 --- a/Storage-HA-Scenarios.rst +++ /dev/null @@ -1,442 +0,0 @@ -Storage and High Availability Scenarios -======================================= - -5.1 Elements of HA Storage Management and Delivery --------------------------------------------------- - -Storage infrastructure, in any environment, can be broken down into two -domains: Data Path and Control Path. Generally, High Availability of the -storage infrastructure is measured by the occurence of Data -Unavailability and Data Loss (DU/DL) events. While that meaning is -obvious as it relates to the Data Path, it is also applicable to Control -Path as well. The inability to attach a volume that has data to a host, -for example, can be considered a Data Unavailability event. Likewise, -the inability to create a volume to store data could be considered Data -Loss since it may result in the inability to store critical data. - -Storage HA mechanisms are an integral part of most High Availability -solutions today. In the first two sections below, we define the -mechanisms of redundancy and protection required in the infrastructure -for storage delivery in both the Data and Control Paths. Storage -services that have these mechanisms can be used in HA environments that -are based on a highly available storage infrastructure. - -In the third section below, we examine HA implementations that rely on -highly available storage infrastructure. Note that the scope throughout this -section is focused on local HA solutions. This does not address rapid remote -Disaster Recovery scenarios that may be provided by storage, nor -does it address metro active/active environments that implement stretched -clusters of hosts across multiple sites for workload migration and availability. - - -5.2 Storage Failure & Recovery Scenarios: Storage Data Path ------------------------------------------------------------ - -In the failure and recovery scenarios described below, a redundant -network infrastructure provides HA through network-related device -failures, while a variety of strategies are used to reduce or minimize -DU/DL events based on storage system failures. This starts with redundant -storage network paths, as shown in Figure 29. - -.. figure:: StorageImages/RedundantStoragePaths.png - :alt: HA Storage Infrastructure - :figclass: align-center - - Figure 29: Typical Highly Available Storage Infrastructure - -Storage implementations vary tremendously, and the recovery mechanisms -for each implementation will vary. These scenarios described below are -limited to 1) high level descriptions of the most common implementations -since it is unpredictable as to -which storage implementations may be used for NFVI; 2) HW- and -SW-related failures (and recovery) of the storage data path, and not -anything associated with user configuration and operational issues which -typically create the most common storage failure scenarios; 3) -non-LVM/DAS based storage implementations(managing failure and recovery -in LVM-based storage for OpenStack is a very different scenario with -less of a reliable track record); and 4) I will assume block storage -only, and not object storage, which is often used for stateless -applications (at a high level, object stores may include a -subset of the block scenarios under the covers). - -To define the requirements for the data path, I will start at the -compute node and work my way down the storage IO stack and touch on both -HW and SW failure/recovery scenarios for HA along the way. I will use Figure 1 as a reference. - -1. Compute IO driver: Assuming iSCSI for connectivity between the -compute and storage, an iSCSI initiator on the compute node maintains -redundant connections to multiple iSCSI targets for the same storage -service. These redundant connections may be aggregated for greater -throughput, or run independently. This redundancy allows the iSCSI -Initiator to handle failures in network connectivity from compute to -storage infrastructure. (Fibre Channel works largely the same way, as do -proprietary drivers that connect a host's IO stack to storage systems). - -2. Compute node network interface controller (NIC): This device may -fail, and said failure reported via whatever means is in place for such -reporting from the host.The redundant paths between iSCSI initiators and -targets will allow connectivity from compute to storage to remain up, -though operating at reduced capacity. - -3. Network Switch failure for storage network: Assuming there are -redundant switches in place, and everything is properly configured so -that two compute NICs go to two separate switches, which in turn go to -two different storage controllers, then a switch may fail and the -redundant paths between iSCSI initiators and targets allows connectivity -from compute to storage to operational, though operating at reduced -capacity. - -4. Storage system network interface failure: Assuming there are -redundant storage system network interfaces (on separate storage -controllers), then one may fail and the redundant paths between iSCSI -initiators and targets allows connectivity from compute to storage to -remain operational, though operating at reduced performance. The extent -of the reduced performance is dependent upon the storage architecture. -See 3.5 for more. - -5. Storage controller failure: A storage system can, at a very high -level, be described as composed of network interfaces, one or more -storage controllers that manage access to data, and a shared Data Path -access to the HDD/SSD subsystem. The network interface failure is -described in #4, and the HDD/SSD subsystem is described in #6. All -modern storage architectures have either redundant or distributed -storage controller architectures. In **dual storage controller -architectures**, high availability is maintained through the ALUA -protocol maintaining access to primary and secondary paths to iSCSI -targets. Once a storage controller fails, the array operates in -(potentially) degraded performance mode until the failed storage controller is -replaced. The degree of reduced performance is dependent on the overall -original load on the array. Dual storage controller arrays also remain at risk -of a Data Unavailability event if the second storage controller should fail. -This is rare, but should be accounted for in planning support and -maintenance contracts. - -**Distributed storage controller architectures** are generally server-based, -which may or may not operate on the compute servers in Converged -Infrastructure environments. Hence the concept of “storage controller” -is abstract in that it may involve a distribution of software components -across multiple servers. Examples: Ceph and ScaleIO. In these environments, -the data may be stored -redundantly, and metadata for accessing the data in these redundant -locations is available for whichever compute node needs the data (with -authorization, of course). Data may also be stored using erasure coding -(EC) for greater efficiency. The loss of a storage controller in this -context leads to a discussion of impact caused by loss of a server in -this distributed storage controller architecture. In the event of such a loss, -if data is held in duplicate or triplicate on other servers, then access -is simply redirected to maintain data availability. In the case of -EC-based protection, then the data is simply re-built on the fly. The -performance and increased risk impact in this case is dependent on the -time required to rebalance storage distribution across other servers in -the environment. Depending on configuration and implementation, it could -impact storage access performance to VNFs as well. - -6. HDD/SSD subsystem: This subsystem contains any RAID controllers, -spinning hard disk drives, and Solid State Drives. The failure of a RAID -controller is equivalent to failure of a storage controller, as -described in 5 above. The failure of one or more storage devices is -protected by either RAID parity-based protection, Erasure Coding -protection, or duplicate/triplicate storage of the data. RAID and -Erasure Coding are typically more efficient in terms of space -efficiency, but duplicate/triplicate provides better performance. This -tradeoff is a common point of contention among implementations, and this -will not go into greater detail than to assume that failed devices do -not cause Data Loss events due to these protection algorithms. Multiple -device failures can potentially cause Data Loss events, and the risk of -each method must be taken into consideration for the HA requirements of -the desired deployment. - -5.3 Storage Failure & Recovery Scenarios: Storage Control Path --------------------------------------------------------------- - -As it relates to an NFVI environment, as proposed by OPNFV, there are -two parts to the storage control path. - -* The storage system-specific control path to the storage controller - -* The OpenStack-specific cloud management framework for managing different -storage elements - - -5.3.1 Storage System Control Paths -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -High Availability of a storage controller is storage -system-specific. Breaking it down to implementation variants is the best -approach. However, both variants assume an IP-based management API in -order to leverage network redundancy mechanisms for ubiquitous -management access. - -An appliance style storage array with dual storage controllers must implement IP -address failover for the management API's IP endpoint in either an -active/active or active/passive configuration. Likewise, a storage array -with >2 storage controllers would bring up a management endpoint on -another storage controller in such an event. Cluster-style IP address load -balancing is also a viable implementation in these scenarios. - -In the case of distributed storage controller architectures, the storage system -provides redundant storage controller interfaces. E.g., Ceph's RADOS provides -redundant paths to access an OSD for volume creation or access. In EMC's -ScaleIO, there are redundant MetaData Managers for managing volume -creation and access. In the case of the former, the access is via -proprietary protocol, in the case of the latter, it is via HTTP-based -REST API. Other storage implementations may also provide alternative -methods, but any enterprise-class storage system will have built-in HA -for management API access. - -Finally, note that single server-based storage solutions, such as LVM, -do not have HA solutions for control paths. If the server is failed, the -management of that server's storage is not available. - -5.3.2 OpenStack Controller Management -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -OpenStack cloud management is comprised of a number of different -function-specific management modules such as Keystone for Identity and -Access management (IAM), Nova for compute management, Cinder for block -storage management, Swift for Object Storage delivery, Neutron for -Network management, and Glance as an image repository. In smaller -single-cloud environments, these management systems are managed in -concert for High Availability; in larger multi-cloud environments, the -Keystone IAM may logically stand alone in its own HA delivery across the -multiple clouds, as might Swift as a common Object Store. Nova, Cinder, -and Glance may have separate scopes of management, but they are more -typically managed together as a logical cloud deployment. - -It is the OpenStack deployment mechanisms that are responsible for HA -deployment of these HA management infrastructures. These tools, such as -Fuel, RDO, and others, have matured to include highly available -implementations for the database, the API, and each of the manager -modules associated with the scope of cloud management domains. - -There are many interdependencies among these modules that impact Cinder high availability. -For example: - -* Cinder is implemented as an Active/Standby failover implementation since it -requires a single point of control at one time for the Cinder manager/driver implementation. -The Cinder manager/driver is deployed on two of the three OpenStack controller nodes, and -one is made active while the other is passive. This may be improved to active/active -in a future release. - -* A highly available database implementation must be delivered -using something like MySQL/Galera replication across the 3 OpenStack controller -nodes. Cinder requires an HA database in order for it to be HA. - -* A redundant RabbitMQ messaging implementation across the same -three OpenStack controller nodes. Likewise, Cinder requires an HA messaging system. - -* A redundant OpenStack API to ensure Cinder requests can be delivered. - -* An HA Cluster Manager, like PaceMaker for monitoring each of the -deployed manager elements on the OpenStack controllers, with restart capability. -Keepalived is an alternative implementation for monitoring processes and restarting on -alternate OpenStack controller nodes. While statistics are lacking, it is generally -believed that the PaceMaker implementation is more frequently implemented -in HA environments. - - -For more information on OpenStack and Cinder HA, see http://docs.openstack.org/ha-guide -for current thinking. - -While the specific combinations of management functions in these -redundant OpenStack controllers may vary with the specific small/large environment -deployment requirements, the basic implementation of three OpenStack controller -redundancy remains relatively common. In these implementations, the -highly available OpenStack controller environment provides HA access to -the highly available storage controllers via the highly available IP -network. - - -5.4 The Role of Storage in HA ------------------------------ - -In the sections above, we describe data and control path requirements -and example implementations for delivery of highly available storage -infrastructure. In summary: - -* Most modern storage infrastructure implementations are inherently -highly available. Exceptions certainly apply; e.g., simply using LVM for -storage presentation at each server does not satisfy HA requirements. -However, modern storage systems such as Ceph, ScaleIO, XIV, VNX, and -many others with OpenStack integrations, certainly do have such HA -capabilities. - -* This is predominantly through network-accessible shared storage -systems in tightly coupled configurations such as clustered hosts, or in -loosely coupled configurations such as with global object stores. - - -Storage is an integral part of HA delivery today for applications, -including VNFs. This is examined below in terms of using storage as a -key part of HA delivery, the possible scope and limitations of that -delivery, and example implementations for delivery of such service. We -will examine this for both block and object storage infrastructures below. - -5.4.1 VNF, VNFC, and VM HA in a Block Storage HA Context -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Several scenarios were described in another section with regard to -managing HA at the VNFC level, with variants of recovery based on either -VIM- or VNFM-based reporting/detection/recovery mechanisms. In a block -storage environment, these differentiations are abstract and -meaningless, regardless of whether it is or is not intended to be HA. - -In a block storage context, HA is delivered via a logical block device -(sometimes called a Logical Unit, or LUN), or in some cases, to a VM. -VM and logical block devices are the units of currency. - -.. figure:: StorageImages/HostStorageCluster.png - :alt: Host Storage Cluster - :figclass: align-center - - Figure 30: Typical HA Cluster With Shared Storage - -In Figure 30, several hosts all share access, via an IP network -or via Fibre Channel, to a common set of logical storage devices. In an -ESX cluster implementation, these hosts all access all devices with -coordination provided with the SCSI Reservation mechanism. In the -particular ESX case, the logical storage devices provided by the storage -service actually aggregate volumes (VMDKs) utilized by VMs. As a result, -multiple host access to the same storage service logical device is -dynamic. The vSphere management layer provides for host cluster -management. - -In other cases, such as for KVM, cluster management is not formally -required, per se, because each logical block device presented by the -storage service is uniquely allocated for one particular VM which can -only execute on a single host at a time. In this case, any host that can -access the same storage service is potentially a part of the "cluster". -While *potential* access from another host to the same logical block -device is necessary, the actual connectivity is restricted to one host -at a time. This is more of a loosely coupled cluster implementation, -rather than the tightly coupled cluster implementation of ESX. - -So, if a single VNF is implemented as a single VM, then HA is provided -by allowing that VM to execute on a different host, with access to the -same logical block device and persistent data for that VM, located on -the storage service. This also applies to multiple VNFs implemented -within a single VM, though it impacts all VNFs together. - -If a single VNF is implemented across multiple VMs as multiple VNFCs, -then all of the VMs that comprise the VNF may need to be protected in a consistent -fashion. The storage service is not aware of the -distinction from the previous example. However, a higher level -implementation, such as an HA Manager (perhaps implemented in a VNFM) -may monitor and restart a collection of VMs on alternate hosts. In an ESX environment, -VM restarts are most expeditiously handled by using vSphere-level HA -mechanisms within an HA cluster for individual or collections of VMs. -In KVM environments, a separate HA -monitoring service, such as Pacemaker, can be used to monitor individual -VMs, or entire multi-VM applications, and provide restart capabilities -on separately configured hosts that also have access to the same logical -storage devices. - -VM restart times, however, are measured in 10's of seconds. This may -sometimes meet the SAL-3 recovery requirements for General Consumer, -Public, and ISP Traffic, but will never meet the 5-6 seconds required -for SAL-1 Network Operator Control and Emergency Services. For this, -additional capabilities are necessary. - -In order to meet SAL-1 restart times, it is necessary to have: 1. A hot -spare VM already up and running in an active/passive configuration 2. -Little-to-no-state update requirements for the passive VM to takeover. - -Having a spare VM up and running is easy enough, but putting that VM in -an appropriate state to take over execution is the difficult part. In shared storage -implementations for Fault Tolerance, which can achieve SAL-1 requirements, -the VMs share access to the same storage device, and another wrapper function -is used to update internal memory state for every interaction to the active -VM. - -This may be done in one of two ways, as illustrated in Figure 31. In the first way, -the hypervisor sends all interface interactions to the passive as well -as the active VM. The interaction is handled completely by -hypervisor-to-hypervisor wrappers, as represented by the purple box encapsulating -the VM in Figure 31, and is completely transparent to the VM. -This is available with the vSphere Fault Tolerant option, but not with -KVM at this time. - -.. figure:: StorageImages/FTCluster.png - :alt: FT host and storage cluster - :figclass: align-center - - Figure 31: A Fault Tolerant Host/Storage Configuration - -In the second way, a VM-level wrapper is used to capture checkpoints of -state from the active VM and transfers these to the passive VM, similarly represented -as the purple box encapsulating the VM in Figure 3. There -are various levels of application-specific integration required for this -wrapper to capture and transfer checkpoints of state, depending on the -level of state consistency required. OpenSAF is an example of an -application wrapper that can be used for this purpose. Both techniques -have significant network bandwidth requirements and may have certain -limitations and requirements for implementation. - -In both cases, the active and passive VMs share the same storage infrastructure. -Although the OpenSAF implementation may also utilize separate storage infrastructure -as well (not shown in Figure 3). - -Looking forward to the long term, both of these may be made obsolete. As soon as 2016, -PCIe fabrics will start to be available that enable shared NVMe-based -storage systems. While these storage systems may be used with -traditional protocols like SCSI, they will also be usable with true -NVMe-oriented applications whose memory state are persisted, and can be -shared, in an active/passive mode across hosts. The HA mechanisms here -are yet to be defined, but will be far superior to either of the -mechanisms described above. This is still a future. - - -5.4.2 HA and Object stores in loosely coupled compute environments -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Whereas block storage services require tight coupling of hosts to -storage services via SCSI protocols, the interaction of applications -with HTTP-based object stores utilizes a very loosely coupled -relationship. This means that VMs can come and go, or be organized as an -N+1 redundant deployment of VMs for a given VNF. Each individual object -transaction constitutes the duration of the coupling, whereas with -SCSI-based logical block devices, the coupling is active for the -duration of the VM's mounting of the device. - -However, the requirement for implementation here is that the state of a -transaction being performed is made persistent to the object store by -the VM, as the restartable checkpoint for high availability. Multiple -VMs may access the object store somewhat simultaneously, and it is -required that each object transaction is made idempotent by the -application. - -HA restart of a transaction in this environment is dependent on failure -detection and transaction timeout values for applications calling the -VNFs. These may be rather high and even unachievable for the SAL -requirements. For example, while the General Consumer, Public, and ISP -Traffic recovery time for SAL-3 is 20-25 seconds, default browser -timeouts are upwards of 120 seconds. Common default timeouts for -applications using HTTP are typically around 10 seconds or higher -(browsers are upward of 120 seconds), so this puts a requirement on the -load balancers to manage and restart transactions in a timeframe that -may be a challenge to meeting even SAL-3 requirements. - -Despite these issues of performance, the use of object storage for highly -available solutions in native cloud applications is very powerful. Object -storage services are generally globally distributed and replicated using -eventual consistency techniques, though transaction-level consistency can -also be achieved in some cases (at the cost of performance). (For an interesting -discussion of this, lookup the CAP Theorem.) - - -5.5 Summary ------------ - -This section addressed several points: - -* Modern storage systems are inherently Highly Available based on modern and reasonable -implementations and deployments. - -* Storage is typically a central component in offering highly available infrastructures, -whether for block storage services for traditional applications, or through object -storage services that may be shared globally with eventual consistency. - -* Cinder HA management capabilities are defined and available through the use of -OpenStack deployment tools, making the entire storage control and data paths -highly available. - |