Added section and figure numbers.

Add line wrap Updated per review comments. Including new pictures. Change-Id: Ib7a23111a6daba300b6e6afca47a44c99c8dbf65
author: Edgar StPierre <edgar.stpierre@emc.com> 2015-12-15 11:34:30 -0500
committer: Edgar StPierre <edgar.stpierre@emc.com> 2016-01-19 21:23:14 -0500
commit: b3fcb5c186d0a008d70197bb77f3ea370f78807a (patch)
tree: b4e3a87c29f25fdf2255304c1e61afb0d50f0c4e /Storage-HA-Scenarios.rst
parent: 2ad2498270014e24e7322e98f2a4947a2384e197 (diff)
1 files changed, 442 insertions, 0 deletions
diff --git a/Storage-HA-Scenarios.rst b/Storage-HA-Scenarios.rst
new file mode 100644
index 0000000..b8b37a3
--- /dev/null
+++ b/Storage-HA-Scenarios.rst
@@ -0,0 +1,442 @@
+Storage and High Availability Scenarios
+=======================================
+
+5.1 Elements of HA Storage Management and Delivery
+--------------------------------------------------
+
+Storage infrastructure, in any environment, can be broken down into two
+domains: Data Path and Control Path. Generally, High Availability of the
+storage infrastructure is measured by the occurence of Data
+Unavailability and Data Loss (DU/DL) events. While that meaning is
+obvious as it relates to the Data Path, it is also applicable to Control
+Path as well. The inability to attach a volume that has data to a host,
+for example, can be considered a Data Unavailability event. Likewise,
+the inability to create a volume to store data could be considered Data
+Loss since it may result in the inability to store critical data.
+
+Storage HA mechanisms are an integral part of most High Availability
+solutions today. In the first two sections below, we define the
+mechanisms of redundancy and protection required in the infrastructure
+for storage delivery in both the Data and Control Paths. Storage
+services that have these mechanisms can be used in HA environments that
+are based on a highly available storage infrastructure.
+
+In the third section below, we examine HA implementations that rely on
+highly available storage infrastructure. Note that the scope throughout this
+section is focused on local HA solutions. This does not address rapid remote
+Disaster Recovery scenarios that may be provided by storage, nor
+does it address metro active/active environments that implement stretched 
+clusters of hosts across multiple sites for workload migration and availability.
+
+
+5.2 Storage Failure & Recovery Scenarios: Storage Data Path
+-----------------------------------------------------------
+
+In the failure and recovery scenarios described below, a redundant
+network infrastructure provides HA through network-related device
+failures, while a variety of strategies are used to reduce or minimize
+DU/DL events based on storage system failures. This starts with redundant
+storage network paths, as shown in Figure 29.
+
+.. figure:: StorageImages/RedundantStoragePaths.png
+     :alt: HA Storage Infrastructure
+     :figclass: align-center
+     
+     Figure 29: Typical Highly Available Storage Infrastructure
+     
+Storage implementations vary tremendously, and the recovery mechanisms
+for each implementation will vary. These scenarios described below are
+limited to 1) high level descriptions of the most common implementations 
+since it is unpredictable as to
+which storage implementations may be used for NFVI; 2) HW- and
+SW-related failures (and recovery) of the storage data path, and not
+anything associated with user configuration and operational issues which
+typically create the most common storage failure scenarios; 3)
+non-LVM/DAS based storage implementations(managing failure and recovery
+in LVM-based storage for OpenStack is a very different scenario with
+less of a reliable track record); and 4) I will assume block storage
+only, and not object storage, which is often used for stateless
+applications (at a high level, object stores may include a
+subset of the block scenarios under the covers).
+
+To define the requirements for the data path, I will start at the
+compute node and work my way down the storage IO stack and touch on both
+HW and SW failure/recovery scenarios for HA along the way. I will use Figure 1 as a reference.
+
+1. Compute IO driver: Assuming iSCSI for connectivity between the
+compute and storage, an iSCSI initiator on the compute node maintains
+redundant connections to multiple iSCSI targets for the same storage
+service. These redundant connections may be aggregated for greater
+throughput, or run independently. This redundancy allows the iSCSI
+Initiator to handle failures in network connectivity from compute to
+storage infrastructure. (Fibre Channel works largely the same way, as do
+proprietary drivers that connect a host's IO stack to storage systems).
+
+2. Compute node network interface controller (NIC): This device may
+fail, and said failure reported via whatever means is in place for such
+reporting from the host.The redundant paths between iSCSI initiators and
+targets will allow connectivity from compute to storage to remain up,
+though operating at reduced capacity.
+
+3. Network Switch failure for storage network: Assuming there are
+redundant switches in place, and everything is properly configured so
+that two compute NICs go to two separate switches, which in turn go to
+two different storage controllers, then a switch may fail and the
+redundant paths between iSCSI initiators and targets allows connectivity
+from compute to storage to operational, though operating at reduced
+capacity.
+
+4. Storage system network interface failure: Assuming there are
+redundant storage system network interfaces (on separate storage
+controllers), then one may fail and the redundant paths between iSCSI
+initiators and targets allows connectivity from compute to storage to
+remain operational, though operating at reduced performance. The extent
+of the reduced performance is dependent upon the storage architecture.
+See 3.5 for more.
+
+5. Storage controller failure: A storage system can, at a very high
+level, be described as composed of network interfaces, one or more
+storage controllers that manage access to data, and a shared Data Path
+access to the HDD/SSD subsystem. The network interface failure is
+described in #4, and the HDD/SSD subsystem is described in #6. All
+modern storage architectures have either redundant or distributed
+storage controller architectures. In **dual storage controller
+architectures**, high availability is maintained through the ALUA
+protocol maintaining access to primary and secondary paths to iSCSI
+targets. Once a storage controller fails, the array operates in
+(potentially) degraded performance mode until the failed storage controller is
+replaced. The degree of reduced performance is dependent on the overall
+original load on the array. Dual storage controller arrays also remain at risk
+of a Data Unavailability event if the second storage controller should fail.
+This is rare, but should be accounted for in planning support and
+maintenance contracts.
+
+**Distributed storage controller architectures** are generally server-based,
+which may or may not operate on the compute servers in Converged
+Infrastructure environments. Hence the concept of “storage controller”
+is abstract in that it may involve a distribution of software components
+across multiple servers. Examples: Ceph and ScaleIO. In these environments, 
+the data may be stored
+redundantly, and metadata for accessing the data in these redundant
+locations is available for whichever compute node needs the data (with
+authorization, of course). Data may also be stored using erasure coding
+(EC) for greater efficiency. The loss of a storage controller in this
+context leads to a discussion of impact caused by loss of a server in
+this distributed storage controller architecture. In the event of such a loss,
+if data is held in duplicate or triplicate on other servers, then access
+is simply redirected to maintain data availability. In the case of
+EC-based protection, then the data is simply re-built on the fly. The
+performance and increased risk impact in this case is dependent on the
+time required to rebalance storage distribution across other servers in
+the environment. Depending on configuration and implementation, it could
+impact storage access performance to VNFs as well.
+
+6. HDD/SSD subsystem: This subsystem contains any RAID controllers,
+spinning hard disk drives, and Solid State Drives. The failure of a RAID
+controller is equivalent to failure of a storage controller, as
+described in 5 above. The failure of one or more storage devices is
+protected by either RAID parity-based protection, Erasure Coding
+protection, or duplicate/triplicate storage of the data. RAID and
+Erasure Coding are typically more efficient in terms of space
+efficiency, but duplicate/triplicate provides better performance. This
+tradeoff is a common point of contention among implementations, and this
+will not go into greater detail than to assume that failed devices do
+not cause Data Loss events due to these protection algorithms. Multiple
+device failures can potentially cause Data Loss events, and the risk of
+each method must be taken into consideration for the HA requirements of
+the desired deployment.
+
+5.3 Storage Failure & Recovery Scenarios: Storage Control Path
+--------------------------------------------------------------
+
+As it relates to an NFVI environment, as proposed by OPNFV, there are
+two parts to the storage control path.
+
+* The storage system-specific control path to the storage controller 
+
+* The OpenStack-specific cloud management framework for managing different
+storage elements
+
+
+5.3.1 Storage System Control Paths 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+High Availability of a storage controller is storage
+system-specific. Breaking it down to implementation variants is the best
+approach. However, both variants assume an IP-based management API in
+order to leverage network redundancy mechanisms for ubiquitous
+management access.
+
+An appliance style storage array with dual storage controllers must implement IP
+address failover for the management API's IP endpoint in either an
+active/active or active/passive configuration. Likewise, a storage array
+with >2 storage controllers would bring up a management endpoint on
+another storage controller in such an event. Cluster-style IP address load
+balancing is also a viable implementation in these scenarios.
+
+In the case of distributed storage controller architectures, the storage system
+provides redundant storage controller interfaces. E.g., Ceph's RADOS provides
+redundant paths to access an OSD for volume creation or access. In EMC's
+ScaleIO, there are redundant MetaData Managers for managing volume
+creation and access. In the case of the former, the access is via
+proprietary protocol, in the case of the latter, it is via HTTP-based
+REST API. Other storage implementations may also provide alternative
+methods, but any enterprise-class storage system will have built-in HA
+for management API access.
+
+Finally, note that single server-based storage solutions, such as LVM,
+do not have HA solutions for control paths. If the server is failed, the
+management of that server's storage is not available.
+
+5.3.2 OpenStack Controller Management 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+OpenStack cloud management is comprised of a number of different
+function-specific management modules such as Keystone for Identity and
+Access management (IAM), Nova for compute management, Cinder for block
+storage management, Swift for Object Storage delivery, Neutron for
+Network management, and Glance as an image repository. In smaller
+single-cloud environments, these management systems are managed in
+concert for High Availability; in larger multi-cloud environments, the
+Keystone IAM may logically stand alone in its own HA delivery across the
+multiple clouds, as might Swift as a common Object Store. Nova, Cinder,
+and Glance may have separate scopes of management, but they are more
+typically managed together as a logical cloud deployment.
+
+It is the OpenStack deployment mechanisms that are responsible for HA
+deployment of these HA management infrastructures. These tools, such as
+Fuel, RDO, and others, have matured to include highly available
+implementations for the database, the API, and each of the manager
+modules associated with the scope of cloud management domains.
+
+There are many interdependencies among these modules that impact Cinder high availability. 
+For example: 
+
+* Cinder is implemented as an Active/Standby failover implementation since it 
+requires a single point of control at one time for the Cinder manager/driver implementation.
+The Cinder manager/driver is deployed on two of the three OpenStack controller nodes, and
+one is made active while the other is passive. This may be improved to active/active 
+in a future release.
+
+* A highly available database implementation must be delivered
+using something like  MySQL/Galera replication across the 3 OpenStack controller
+nodes. Cinder requires an HA database in order for it to be HA.
+
+* A redundant RabbitMQ messaging implementation across the same
+three OpenStack controller nodes. Likewise, Cinder requires an HA messaging system.
+
+* A redundant OpenStack API to ensure Cinder requests can be delivered.
+
+* An HA Cluster Manager, like PaceMaker for monitoring each of the
+deployed manager elements on the OpenStack controllers, with restart capability. 
+Keepalived is an alternative implementation for monitoring processes and restarting on
+alternate OpenStack controller nodes. While statistics are lacking, it is generally 
+believed that the PaceMaker implementation is more frequently implemented
+in HA environments.
+
+
+For more information on OpenStack and Cinder HA, see http://docs.openstack.org/ha-guide 
+for current thinking.
+
+While the specific combinations of management functions in these
+redundant OpenStack controllers may vary with the specific small/large environment
+deployment requirements, the basic implementation of three OpenStack controller
+redundancy remains relatively common. In these implementations, the
+highly available OpenStack controller environment provides HA access to
+the highly available storage controllers via the highly available IP
+network.
+
+
+5.4 The Role of Storage in HA 
+-----------------------------
+
+In the sections above, we describe data and control path requirements
+and example implementations for delivery of highly available storage
+infrastructure. In summary:
+
+* Most modern storage infrastructure implementations are inherently
+highly available. Exceptions certainly apply; e.g., simply using LVM for
+storage presentation at each server does not satisfy HA requirements.
+However, modern storage systems such as Ceph, ScaleIO, XIV, VNX, and
+many others with OpenStack integrations, certainly do have such HA
+capabilities.
+
+* This is predominantly through network-accessible shared storage
+systems in tightly coupled configurations such as clustered hosts, or in
+loosely coupled configurations such as with global object stores.
+
+
+Storage is an integral part of HA delivery today for applications,
+including VNFs. This is examined below in terms of using storage as a
+key part of HA delivery, the possible scope and limitations of that
+delivery, and example implementations for delivery of such service. We
+will examine this for both block and object storage infrastructures below.
+
+5.4.1 VNF, VNFC, and VM HA in a Block Storage HA Context
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Several scenarios were described in another section with regard to
+managing HA at the VNFC level, with variants of recovery based on either
+VIM- or VNFM-based reporting/detection/recovery mechanisms. In a block
+storage environment, these differentiations are abstract and
+meaningless, regardless of whether it is or is not intended to be HA.
+
+In a block storage context, HA is delivered via a logical block device
+(sometimes called a Logical Unit, or LUN), or in some cases, to a VM.
+VM and logical block devices are the units of currency.
+
+.. figure:: StorageImages/HostStorageCluster.png
+     :alt: Host Storage Cluster
+     :figclass: align-center
+     
+     Figure 30: Typical HA Cluster With Shared Storage
+     
+In Figure 30, several hosts all share access, via an IP network
+or via Fibre Channel, to a common set of logical storage devices. In an
+ESX cluster implementation, these hosts all access all devices with
+coordination provided with the SCSI Reservation mechanism. In the
+particular ESX case, the logical storage devices provided by the storage
+service actually aggregate volumes (VMDKs) utilized by VMs. As a result,
+multiple host access to the same storage service logical device is
+dynamic. The vSphere management layer provides for host cluster
+management.
+
+In other cases, such as for KVM, cluster management is not formally
+required, per se, because each logical block device presented by the
+storage service is uniquely allocated for one particular VM which can
+only execute on a single host at a time. In this case, any host that can
+access the same storage service is potentially a part of the "cluster".
+While *potential* access from another host to the same logical block
+device is necessary, the actual connectivity is restricted to one host
+at a time. This is more of a loosely coupled cluster implementation,
+rather than the tightly coupled cluster implementation of ESX.
+
+So, if a single VNF is implemented as a single VM, then HA is provided
+by allowing that VM to execute on a different host, with access to the
+same logical block device and persistent data for that VM, located on
+the storage service. This also applies to multiple VNFs implemented
+within a single VM, though it impacts all VNFs together.
+
+If a single VNF is implemented across multiple VMs as multiple VNFCs, 
+then all of the VMs that comprise the VNF may need to be protected in a consistent 
+fashion.  The storage service is not aware of the
+distinction from the previous example. However, a higher level
+implementation, such as an HA Manager (perhaps implemented in a VNFM)
+may monitor and restart a collection of VMs on alternate hosts. In an ESX environment,
+VM restarts are most expeditiously handled by using vSphere-level HA
+mechanisms within an HA cluster for individual or collections of VMs. 
+In KVM environments, a separate HA
+monitoring service, such as Pacemaker, can be used to monitor individual
+VMs, or entire multi-VM applications, and provide restart capabilities
+on separately configured hosts that also have access to the same logical
+storage devices.
+
+VM restart times, however, are measured in 10's of seconds. This may
+sometimes meet the SAL-3 recovery requirements for General Consumer,
+Public, and ISP Traffic, but will  never meet the 5-6 seconds required
+for SAL-1 Network Operator Control and Emergency Services. For this,
+additional capabilities are necessary.
+
+In order to meet SAL-1 restart times, it is necessary to have: 1. A hot
+spare VM already up and running in an active/passive configuration 2.
+Little-to-no-state update requirements for the passive VM to takeover.
+
+Having a spare VM up and running is easy enough, but putting that VM in
+an appropriate state to take over execution is the difficult part. In shared storage
+implementations for Fault Tolerance, which can achieve SAL-1 requirements, 
+the VMs share access to the same storage device, and another wrapper function
+is used to update internal memory state for every interaction to the active
+VM. 
+
+This may be done in one of two ways, as illustrated in Figure 31. In the first way,
+the hypervisor sends all interface interactions to the passive as well
+as the active VM. The interaction is handled completely by
+hypervisor-to-hypervisor wrappers, as represented by the purple box encapsulating 
+the VM in Figure 31, and is completely transparent to the VM.
+This is available with the vSphere Fault Tolerant option, but not with
+KVM at this time.
+
+.. figure:: StorageImages/FTCluster.png
+     :alt: FT host and storage cluster
+     :figclass: align-center
+     
+     Figure 31: A Fault Tolerant Host/Storage Configuration
+     
+In the second way, a VM-level wrapper is used to capture checkpoints of
+state from the active VM and transfers these to the passive VM, similarly represented 
+as the purple box encapsulating the VM in Figure 3. There
+are various levels of application-specific integration required for this
+wrapper to capture and transfer checkpoints of state, depending on the
+level of state consistency required. OpenSAF is an example of an
+application wrapper that can be used for this purpose. Both techniques
+have significant network bandwidth requirements and may have certain
+limitations and requirements for implementation.
+
+In both cases, the active and passive VMs share the same storage infrastructure. 
+Although the OpenSAF implementation may also utilize separate storage infrastructure 
+as well (not shown in Figure 3).
+
+Looking forward to the long term, both of these may be made obsolete. As soon as 2016,
+PCIe fabrics will start to be available that enable shared NVMe-based
+storage systems. While these storage systems may be used with
+traditional protocols like SCSI, they will also be usable with true
+NVMe-oriented applications whose memory state are persisted, and can be
+shared, in an active/passive mode across hosts. The HA mechanisms here
+are yet to be defined, but will be far superior to either of the
+mechanisms described above. This is still a future.
+
+
+5.4.2 HA and Object stores in loosely coupled compute environments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Whereas block storage services require tight coupling of hosts to
+storage services via SCSI protocols, the interaction of applications
+with HTTP-based object stores utilizes a very loosely coupled
+relationship. This means that VMs can come and go, or be organized as an
+N+1 redundant deployment of VMs for a given VNF. Each individual object
+transaction constitutes the duration of the coupling, whereas with
+SCSI-based logical block devices, the coupling is active for the
+duration of the VM's mounting of the device.
+
+However, the requirement for implementation here is that the state of a
+transaction being performed is made persistent to the object store by
+the VM, as the restartable checkpoint for high availability. Multiple
+VMs may access the object store somewhat simultaneously, and it is
+required that each object transaction is made idempotent by the
+application.
+
+HA restart of a transaction in this environment is dependent on failure
+detection and transaction timeout values for applications calling the
+VNFs. These may be rather high and even unachievable for the SAL
+requirements. For example, while the General Consumer, Public, and ISP
+Traffic recovery time for SAL-3 is 20-25 seconds, default browser
+timeouts are upwards of 120 seconds. Common default timeouts for
+applications using HTTP are typically around 10 seconds or higher
+(browsers are upward of 120 seconds), so this puts a requirement on the
+load balancers to manage and restart transactions in a timeframe that
+may be a challenge to meeting even SAL-3 requirements.
+
+Despite these issues of performance, the use of object storage for highly 
+available solutions in native cloud applications is very powerful. Object
+storage services are generally globally distributed and replicated using 
+eventual consistency techniques, though transaction-level consistency can
+also be achieved in some cases (at the cost of performance). (For an interesting
+discussion of this, lookup the CAP Theorem.)
+
+
+5.5 Summary
+-----------
+
+This section addressed several points:
+
+* Modern storage systems are inherently Highly Available based on modern and reasonable
+implementations and deployments.
+
+* Storage is typically a central component in offering highly available infrastructures, 
+whether for block storage services for traditional applications, or through object
+storage services that may be shared globally with eventual consistency.
+
+* Cinder HA management capabilities are defined and available through the use of 
+OpenStack deployment tools, making the entire storage control and data paths 
+highly available.
+
author	Edgar StPierre <edgar.stpierre@emc.com>	2015-12-15 11:34:30 -0500
committer	Edgar StPierre <edgar.stpierre@emc.com>	2016-01-19 21:23:14 -0500
commit	b3fcb5c186d0a008d70197bb77f3ea370f78807a (patch)
tree	b4e3a87c29f25fdf2255304c1e61afb0d50f0c4e /Storage-HA-Scenarios.rst
parent	2ad2498270014e24e7322e98f2a4947a2384e197 (diff)