Storage and High Availability Scenarios ======================================= 5.1 Elements of HA Storage Management and Delivery -------------------------------------------------- Storage infrastructure, in any environment, can be broken down into two domains: Data Path and Control Path. Generally, High Availability of the storage infrastructure is measured by the occurence of Data Unavailability and Data Loss (DU/DL) events. While that meaning is obvious as it relates to the Data Path, it is also applicable to Control Path as well. The inability to attach a volume that has data to a host, for example, can be considered a Data Unavailability event. Likewise, the inability to create a volume to store data could be considered Data Loss since it may result in the inability to store critical data. Storage HA mechanisms are an integral part of most High Availability solutions today. In the first two sections below, we define the mechanisms of redundancy and protection required in the infrastructure for storage delivery in both the Data and Control Paths. Storage services that have these mechanisms can be used in HA environments that are based on a highly available storage infrastructure. In the third section below, we examine HA implementations that rely on highly available storage infrastructure. Note that the scope throughout this section is focused on local HA solutions. This does not address rapid remote Disaster Recovery scenarios that may be provided by storage, nor does it address metro active/active environments that implement stretched clusters of hosts across multiple sites for workload migration and availability. 5.2 Storage Failure & Recovery Scenarios: Storage Data Path ----------------------------------------------------------- In the failure and recovery scenarios described below, a redundant network infrastructure provides HA through network-related device failures, while a variety of strategies are used to reduce or minimize DU/DL events based on storage system failures. This starts with redundant storage network paths, as shown in Figure 29. .. figure:: StorageImages/RedundantStoragePaths.png :alt: HA Storage Infrastructure :figclass: align-center Figure 29: Typical Highly Available Storage Infrastructure Storage implementations vary tremendously, and the recovery mechanisms for each implementation will vary. These scenarios described below are limited to 1) high level descriptions of the most common implementations since it is unpredictable as to which storage implementations may be used for NFVI; 2) HW- and SW-related failures (and recovery) of the storage data path, and not anything associated with user configuration and operational issues which typically create the most common storage failure scenarios; 3) non-LVM/DAS based storage implementations(managing failure and recovery in LVM-based storage for OpenStack is a very different scenario with less of a reliable track record); and 4) I will assume block storage only, and not object storage, which is often used for stateless applications (at a high level, object stores may include a subset of the block scenarios under the covers). To define the requirements for the data path, I will start at the compute node and work my way down the storage IO stack and touch on both HW and SW failure/recovery scenarios for HA along the way. I will use Figure 1 as a reference. 1. Compute IO driver: Assuming iSCSI for connectivity between the compute and storage, an iSCSI initiator on the compute node maintains redundant connections to multiple iSCSI targets for the same storage service. These redundant connections may be aggregated for greater throughput, or run independently. This redundancy allows the iSCSI Initiator to handle failures in network connectivity from compute to storage infrastructure. (Fibre Channel works largely the same way, as do proprietary drivers that connect a host's IO stack to storage systems). 2. Compute node network interface controller (NIC): This device may fail, and said failure reported via whatever means is in place for such reporting from the host.The redundant paths between iSCSI initiators and targets will allow connectivity from compute to storage to remain up, though operating at reduced capacity. 3. Network Switch failure for storage network: Assuming there are redundant switches in place, and everything is properly configured so that two compute NICs go to two separate switches, which in turn go to two different storage controllers, then a switch may fail and the redundant paths between iSCSI initiators and targets allows connectivity from compute to storage to operational, though operating at reduced capacity. 4. Storage system network interface failure: Assuming there are redundant storage system network interfaces (on separate storage controllers), then one may fail and the redundant paths between iSCSI initiators and targets allows connectivity from compute to storage to remain operational, though operating at reduced performance. The extent of the reduced performance is dependent upon the storage architecture. See 3.5 for more. 5. Storage controller failure: A storage system can, at a very high level, be described as composed of network interfaces, one or more storage controllers that manage access to data, and a shared Data Path access to the HDD/SSD subsystem. The network interface failure is described in #4, and the HDD/SSD subsystem is described in #6. All modern storage architectures have either redundant or distributed storage controller architectures. In **dual storage controller architectures**, high availability is maintained through the ALUA protocol maintaining access to primary and secondary paths to iSCSI targets. Once a storage controller fails, the array operates in (potentially) degraded performance mode until the failed storage controller is replaced. The degree of reduced performance is dependent on the overall original load on the array. Dual storage controller arrays also remain at risk of a Data Unavailability event if the second storage controller should fail. This is rare, but should be accounted for in planning support and maintenance contracts. **Distributed storage controller architectures** are generally server-based, which may or may not operate on the compute servers in Converged Infrastructure environments. Hence the concept of “storage controller” is abstract in that it may involve a distribution of software components across multiple servers. Examples: Ceph and ScaleIO. In these environments, the data may be stored redundantly, and metadata for accessing the data in these redundant locations is available for whichever compute node needs the data (with authorization, of course). Data may also be stored using erasure coding (EC) for greater efficiency. The loss of a storage controller in this context leads to a discussion of impact caused by loss of a server in this distributed storage controller architecture. In the event of such a loss, if data is held in duplicate or triplicate on other servers, then access is simply redirected to maintain data availability. In the case of EC-based protection, then the data is simply re-built on the fly. The performance and increased risk impact in this case is dependent on the time required to rebalance storage distribution across other servers in the environment. Depending on configuration and implementation, it could impact storage access performance to VNFs as well. 6. HDD/SSD subsystem: This subsystem contains any RAID controllers, spinning hard disk drives, and Solid State Drives. The failure of a RAID controller is equivalent to failure of a storage controller, as described in 5 above. The failure of one or more storage devices is protected by either RAID parity-based protection, Erasure Coding protection, or duplicate/triplicate storage of the data. RAID and Erasure Coding are typically more efficient in terms of space efficiency, but duplicate/triplicate provides better performance. This tradeoff is a common point of contention among implementations, and this will not go into greater detail than to assume that failed devices do not cause Data Loss events due to these protection algorithms. Multiple device failures can potentially cause Data Loss events, and the risk of each method must be taken into consideration for the HA requirements of the desired deployment. 5.3 Storage Failure & Recovery Scenarios: Storage Control Path -------------------------------------------------------------- As it relates to an NFVI environment, as proposed by OPNFV, there are two parts to the storage control path. * The storage system-specific control path to the storage controller * The OpenStack-specific cloud management framework for managing different storage elements 5.3.1 Storage System Control Paths ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ High Availability of a storage controller is storage system-specific. Breaking it down to implementation variants is the best approach. However, both variants assume an IP-based management API in order to leverage network redundancy mechanisms for ubiquitous management access. An appliance style storage array with dual storage controllers must implement IP address failover for the management API's IP endpoint in either an active/active or active/passive configuration. Likewise, a storage array with >2 storage controllers would bring up a management endpoint on another storage controller in such an event. Cluster-style IP address load balancing is also a viable implementation in these scenarios. In the case of distributed storage controller architectures, the storage system provides redundant storage controller interfaces. E.g., Ceph's RADOS provides redundant paths to access an OSD for volume creation or access. In EMC's ScaleIO, there are redundant MetaData Managers for managing volume creation and access. In the case of the former, the access is via proprietary protocol, in the case of the latter, it is via HTTP-based REST API. Other storage implementations may also provide alternative methods, but any enterprise-class storage system will have built-in HA for management API access. Finally, note that single server-based storage solutions, such as LVM, do not have HA solutions for control paths. If the server is failed, the management of that server's storage is not available. 5.3.2 OpenStack Controller Management ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ OpenStack cloud management is comprised of a number of different function-specific management modules such as Keystone for Identity and Access management (IAM), Nova for compute management, Cinder for block storage management, Swift for Object Storage delivery, Neutron for Network management, and Glance as an image repository. In smaller single-cloud environments, these management systems are managed in concert for High Availability; in larger multi-cloud environments, the Keystone IAM may logically stand alone in its own HA delivery across the multiple clouds, as might Swift as a common Object Store. Nova, Cinder, and Glance may have separate scopes of management, but they are more typically managed together as a logical cloud deployment. It is the OpenStack deployment mechanisms that are responsible for HA deployment of these HA management infrastructures. These tools, such as Fuel, RDO, and others, have matured to include highly available implementations for the database, the API, and each of the manager modules associated with the scope of cloud management domains. There are many interdependencies among these modules that impact Cinder high availability. For example: * Cinder is implemented as an Active/Standby failover implementation since it requires a single point of control at one time for the Cinder manager/driver implementation. The Cinder manager/driver is deployed on two of the three OpenStack controller nodes, and one is made active while the other is passive. This may be improved to active/active in a future release. * A highly available database implementation must be delivered using something like MySQL/Galera replication across the 3 OpenStack controller nodes. Cinder requires an HA database in order for it to be HA. * A redundant RabbitMQ messaging implementation across the same three OpenStack controller nodes. Likewise, Cinder requires an HA messaging system. * A redundant OpenStack API to ensure Cinder requests can be delivered. * An HA Cluster Manager, like PaceMaker for monitoring each of the deployed manager elements on the OpenStack controllers, with restart capability. Keepalived is an alternative implementation for monitoring processes and restarting on alternate OpenStack controller nodes. While statistics are lacking, it is generally believed that the PaceMaker implementation is more frequently implemented in HA environments. For more information on OpenStack and Cinder HA, see http://docs.openstack.org/ha-guide for current thinking. While the specific combinations of management functions in these redundant OpenStack controllers may vary with the specific small/large environment deployment requirements, the basic implementation of three OpenStack controller redundancy remains relatively common. In these implementations, the highly available OpenStack controller environment provides HA access to the highly available storage controllers via the highly available IP network. 5.4 The Role of Storage in HA ----------------------------- In the sections above, we describe data and control path requirements and example implementations for delivery of highly available storage infrastructure. In summary: * Most modern storage infrastructure implementations are inherently highly available. Exceptions certainly apply; e.g., simply using LVM for storage presentation at each server does not satisfy HA requirements. However, modern storage systems such as Ceph, ScaleIO, XIV, VNX, and many others with OpenStack integrations, certainly do have such HA capabilities. * This is predominantly through network-accessible shared storage systems in tightly coupled configurations such as clustered hosts, or in loosely coupled configurations such as with global object stores. Storage is an integral part of HA delivery today for applications, including VNFs. This is examined below in terms of using storage as a key part of HA delivery, the possible scope and limitations of that delivery, and example implementations for delivery of such service. We will examine this for both block and object storage infrastructures below. 5.4.1 VNF, VNFC, and VM HA in a Block Storage HA Context ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Several scenarios were described in another section with regard to managing HA at the VNFC level, with variants of recovery based on either VIM- or VNFM-based reporting/detection/recovery mechanisms. In a block storage environment, these differentiations are abstract and meaningless, regardless of whether it is or is not intended to be HA. In a block storage context, HA is delivered via a logical block device (sometimes called a Logical Unit, or LUN), or in some cases, to a VM. VM and logical block devices are the units of currency. .. figure:: StorageImages/HostStorageCluster.png :alt: Host Storage Cluster :figclass: align-center Figure 30: Typical HA Cluster With Shared Storage In Figure 30, several hosts all share access, via an IP network or via Fibre Channel, to a common set of logical storage devices. In an ESX cluster implementation, these hosts all access all devices with coordination provided with the SCSI Reservation mechanism. In the particular ESX case, the logical storage devices provided by the storage service actually aggregate volumes (VMDKs) utilized by VMs. As a result, multiple host access to the same storage service logical device is dynamic. The vSphere management layer provides for host cluster management. In other cases, such as for KVM, cluster management is not formally required, per se, because each logical block device presented by the storage service is uniquely allocated for one particular VM which can only execute on a single host at a time. In this case, any host that can access the same storage service is potentially a part of the "cluster". While *potential* access from another host to the same logical block device is necessary, the actual connectivity is restricted to one host at a time. This is more of a loosely coupled cluster implementation, rather than the tightly coupled cluster implementation of ESX. So, if a single VNF is implemented as a single VM, then HA is provided by allowing that VM to execute on a different host, with access to the same logical block device and persistent data for that VM, located on the storage service. This also applies to multiple VNFs implemented within a single VM, though it impacts all VNFs together. If a single VNF is implemented across multiple VMs as multiple VNFCs, then all of the VMs that comprise the VNF may need to be protected in a consistent fashion. The storage service is not aware of the distinction from the previous example. However, a higher level implementation, such as an HA Manager (perhaps implemented in a VNFM) may monitor and restart a collection of VMs on alternate hosts. In an ESX environment, VM restarts are most expeditiously handled by using vSphere-level HA mechanisms within an HA cluster for individual or collections of VMs. In KVM environments, a separate HA monitoring service, such as Pacemaker, can be used to monitor individual VMs, or entire multi-VM applications, and provide restart capabilities on separately configured hosts that also have access to the same logical storage devices. VM restart times, however, are measured in 10's of seconds. This may sometimes meet the SAL-3 recovery requirements for General Consumer, Public, and ISP Traffic, but will never meet the 5-6 seconds required for SAL-1 Network Operator Control and Emergency Services. For this, additional capabilities are necessary. In order to meet SAL-1 restart times, it is necessary to have: 1. A hot spare VM already up and running in an active/passive configuration 2. Little-to-no-state update requirements for the passive VM to takeover. Having a spare VM up and running is easy enough, but putting that VM in an appropriate state to take over execution is the difficult part. In shared storage implementations for Fault Tolerance, which can achieve SAL-1 requirements, the VMs share access to the same storage device, and another wrapper function is used to update internal memory state for every interaction to the active VM. This may be done in one of two ways, as illustrated in Figure 31. In the first way, the hypervisor sends all interface interactions to the passive as well as the active VM. The interaction is handled completely by hypervisor-to-hypervisor wrappers, as represented by the purple box encapsulating the VM in Figure 31, and is completely transparent to the VM. This is available with the vSphere Fault Tolerant option, but not with KVM at this time. .. figure:: StorageImages/FTCluster.png :alt: FT host and storage cluster :figclass: align-center Figure 31: A Fault Tolerant Host/Storage Configuration In the second way, a VM-level wrapper is used to capture checkpoints of state from the active VM and transfers these to the passive VM, similarly represented as the purple box encapsulating the VM in Figure 3. There are various levels of application-specific integration required for this wrapper to capture and transfer checkpoints of state, depending on the level of state consistency required. OpenSAF is an example of an application wrapper that can be used for this purpose. Both techniques have significant network bandwidth requirements and may have certain limitations and requirements for implementation. In both cases, the active and passive VMs share the same storage infrastructure. Although the OpenSAF implementation may also utilize separate storage infrastructure as well (not shown in Figure 3). Looking forward to the long term, both of these may be made obsolete. As soon as 2016, PCIe fabrics will start to be available that enable shared NVMe-based storage systems. While these storage systems may be used with traditional protocols like SCSI, they will also be usable with true NVMe-oriented applications whose memory state are persisted, and can be shared, in an active/passive mode across hosts. The HA mechanisms here are yet to be defined, but will be far superior to either of the mechanisms described above. This is still a future. 5.4.2 HA and Object stores in loosely coupled compute environments ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Whereas block storage services require tight coupling of hosts to storage services via SCSI protocols, the interaction of applications with HTTP-based object stores utilizes a very loosely coupled relationship. This means that VMs can come and go, or be organized as an N+1 redundant deployment of VMs for a given VNF. Each individual object transaction constitutes the duration of the coupling, whereas with SCSI-based logical block devices, the coupling is active for the duration of the VM's mounting of the device. However, the requirement for implementation here is that the state of a transaction being performed is made persistent to the object store by the VM, as the restartable checkpoint for high availability. Multiple VMs may access the object store somewhat simultaneously, and it is required that each object transaction is made idempotent by the application. HA restart of a transaction in this environment is dependent on failure detection and transaction timeout values for applications calling the VNFs. These may be rather high and even unachievable for the SAL requirements. For example, while the General Consumer, Public, and ISP Traffic recovery time for SAL-3 is 20-25 seconds, default browser timeouts are upwards of 120 seconds. Common default timeouts for applications using HTTP are typically around 10 seconds or higher (browsers are upward of 120 seconds), so this puts a requirement on the load balancers to manage and restart transactions in a timeframe that may be a challenge to meeting even SAL-3 requirements. Despite these issues of performance, the use of object storage for highly available solutions in native cloud applications is very powerful. Object storage services are generally globally distributed and replicated using eventual consistency techniques, though transaction-level consistency can also be achieved in some cases (at the cost of performance). (For an interesting discussion of this, lookup the CAP Theorem.) 5.5 Summary ----------- This section addressed several points: * Modern storage systems are inherently Highly Available based on modern and reasonable implementations and deployments. * Storage is typically a central component in offering highly available infrastructures, whether for block storage services for traditional applications, or through object storage services that may be shared globally with eventual consistency. * Cinder HA management capabilities are defined and available through the use of OpenStack deployment tools, making the entire storage control and data paths highly available.