Add doc and update output according OpenStack spec's updatebrahmaputra.1.0

Add admin-user-guide and configuration-guide for multisite, and update the output requirement to sync. the change in OpenStack spec, which was reviewed in OpenStack community Change-Id: Icff3dda7e204404f8003d6e06cde45151eb03446 Signed-off-by: Chaoyi Huang <joehuang@huawei.com>
author: Chaoyi Huang <joehuang@huawei.com> 2015-12-22 13:24:10 +0800
committer: Chaoyi Huang <joehuang@huawei.com> 2016-01-04 11:01:31 +0800
commit: 8e3d1f151aa2ba629c9ed4ad61862ac147fff3ec (patch)
tree: cc042c0c9934f13c8a6d3f868421ec38550111d8 /docs/requirements
parent: 1a148540057b9bcbc32865217319021ba09ae07b (diff)
3 files changed, 777 insertions, 0 deletions
diff --git a/docs/requirements/VNF_high_availability_across_VIM.rst b/docs/requirements/VNF_high_availability_across_VIM.rst
new file mode 100644
index 0000000..1a7d41b
--- /dev/null
+++ b/docs/requirements/VNF_high_availability_across_VIM.rst
@@ -0,0 +1,160 @@
+This work is licensed under a Creative Commons Attribution 3.0 Unported License.
+http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+=======================================
+VNF high availability across VIM
+=======================================
+
+Problem description
+===================
+
+Abstract
+------------
+
+a VNF (telecom application) should, be able to realize high availability
+deloyment across OpenStack instances.
+
+Description
+------------
+VNF (Telecom application running over cloud) may (already) be designed as
+Active-Standby/Active-Active/N-Way to achieve high availability,
+
+With a telecoms focus, this generally refers both to availability of service
+(i.e. the ability to make new calls), but also maintenance of ongoing control
+plane state and active media processing(i.e. “keeping up” existing calls).
+
+Traditionally telecoms systems are designed to maintain state and calls across
+pretty much the full range of single-point failures.  As listed this includes
+power supply, hard drive, physical server or network switch, but also covers
+software failure, and maintenance operations such as software upgrade.
+
+To provide this support, typically requires state replication between
+application instances (directly or via replicated database services, or via
+private designed message format).  It may also require special case handling of
+media endpoints, to allow transfer of median short time scales (<1s) without
+requiring end-to-end resignalling (e.g.RTP redirection via IP / MAC address
+transfers c.f VRRP).
+
+With a migration to NFV, a commonly expressed desire by carriers is to provide
+the same resilience to any single point(s) of failure in the cloud
+infrastructure.
+
+This could be done by making each cloud instance fully HA (a non-trivial task to
+do right and to prove it has been done right) , but the preferred approach
+appears to be to accept the currently limited availability of a given cloud
+instance (no desire to radically rework this for telecoms), and instead to
+provide solution availability by spreading function across multiple cloud
+instances (i.e. the same approach used today todeal with hardware and software
+failures).
+
+A further advantage of this approach, is it provides a good basis for seamless
+upgrade of infrastructure software revision, where you can spin up an additional
+up-level cloud, gradually transfer over resources / app instances from one of
+your other clouds, before finally turning down the old cloud instance when no
+longer required.
+
+If fast media / control failure over is still required (which many/most carriers
+still seem to believe it is) there are some interesting/hard requirements on the
+networking between cloud instances. To help with this, many people appear
+willing to provide multiple “independent” cloud instances in a single geographic
+site, with special networking between clouds in that physical site.
+"independent" in quotes is because some coordination between cloud instances is
+obviously required, but this has to be implemented in a fashion which reduces
+the potential for correlated failure to very low levels (at least as low as the
+required overall application availability).
+
+Analysis of requirements to OpenStack
+===========================
+The VNF often has different networking plane for different purpose:
+
+external network plane: using for communication with other VNF
+components inter-communication plane: one VNF often consisted of several
+components, this plane is designed for components inter-communication with each
+other
+backup plance: this plane is used for the heart beat or state replication
+between the component's active/standy or active/active or N-way cluster.
+management plane: this plane is mainly for the management purpose
+
+Generally these planes are seperated with each other. And for legacy telecom
+application, each internal plane will have its fixed or flexsible IP addressing
+plane.
+
+to make the VNF can work with HA mode across different OpenStack instances in
+one site (but not limited to), need to support at lease the backup plane across
+different OpenStack instances:
+
+1) Overlay L2 networking or shared L2 provider networks as the backup plance for
+heartbeat or state replication. Overlay L2 network is preferred, the reason is:
+a. Support legacy compatibility: Some telecom app with built-in internal L2
+network, for easy to move these app to VNF, it would be better to provide L2
+network b. Support IP overlapping: multiple VNFs may have overlaping IP address
+for cross OpenStack instance networking
+Therefore, over L2 networking across Neutron feature is required in OpenStack.
+
+2) L3 networking cross OpenStack instance for heartbeat or state replication.
+For L3 networking, we can leverage the floating IP provided in current Neutron,
+so no new feature requirement to OpenStack.
+
+3) The IP address used for VNF to connect with other VNFs should be able to be
+floating cross OpenStack instance. For example, if the master failed, the IP
+address should be used in the standby which is running in another OpenStack
+instance. There are some method like VRRP/GARP etc can help the movement of the
+external IP, so no new feature will be added to OpenStack.
+
+
+Prototype
+-----------
+    None.
+
+Proposed solution
+-----------
+
+    requirements perspective It's up to application descision to use L2 or L3
+networking across Neutron.
+
+    For Neutron, a L2 network is consisted of lots of ports. To make the cross
+Neutron L2 networking is workable, we need some fake remote ports in local
+Neutron to represent VMs in remote site ( remote OpenStack ).
+
+    the fake remote port will reside on some VTEP ( for VxLAN ), the tunneling
+IP address of the VTEP should be the attribute of the fake remote port, so that
+the local port can forward packet to correct tunneling endpoint.
+
+    the idea is to add one more ML2 mechnism driver to capture the fake remote
+port CRUD( creation, retievement, update, delete)
+
+    when a fake remote port is added/update/deleted, then the ML2 mechanism
+driver for these fake ports will activate L2 population, so that the VTEP
+tunneling endpoint information could be understood by other local ports.
+
+    it's also required to be able to query the port's VTEP tunneling endpoint
+information through Neutron API, in order to use these information to create
+fake remote port in another Neutron.
+
+    In the past, the port's VTEP ip address is the host IP where the VM resides.
+But the this BP https://review.openstack.org/#/c/215409/ will make the port free
+of binding to host IP as the tunneling endpoint, you can even specify L2GW ip
+address as the tunneling endpoint.
+
+    Therefore a new BP will be registered to processing the fake remote port, in
+order make cross Neutron L2 networking is feasible. RFE is registered first:
+https://bugs.launchpad.net/neutron/+bug/1484005
+
+
+Gaps
+====
+    1) fake remote port for cross Neutron L2 networking
+
+
+**NAME-THE-MODULE issues:**
+
+* Neutron
+
+Affected By
+-----------
+    OPNFV multisite cloud.
+
+References
+==========
+
diff --git a/docs/requirements/multisite-identity-service-management.rst b/docs/requirements/multisite-identity-service-management.rst
new file mode 100644
index 0000000..b411c28
--- /dev/null
+++ b/docs/requirements/multisite-identity-service-management.rst
@@ -0,0 +1,376 @@
+This work is licensed under a Creative Commons Attribution 3.0 Unported
+License.
+http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+=======================================
+ Multisite identity service management
+=======================================
+
+Glossary
+========
+
+There are 3 types of token supported by OpenStack KeyStone
+    **UUID**
+
+    **PKI/PKIZ**
+
+    **FERNET**
+
+Please refer to reference section for these token formats, benchmark and
+comparation.
+
+
+Problem description
+===================
+
+Abstract
+------------
+
+a user should, using a single authentication point be able to manage virtual
+resources spread over multiple OpenStack regions.
+
+Description
+------------
+
+- User/Group Management: e.g. use of LDAP, should OPNFV be agnostic to this?
+  Reusing the LDAP infrastructure that is mature and has features lacking in
+Keystone (e.g.password aging and policies). KeyStone can use external system to
+do the user authentication, and user/group management could be the job of
+external system, so that KeyStone can reuse/co-work with enterprise identity
+management. KeyStone's main role in OpenStack is to provide
+service(Nova,Cinder...) aware token, and do the authorization. You can refer to
+this post https://blog-nkinder.rhcloud.com/?p=130.Therefore, LDAP itself should
+be a topic out of our scope.
+
+- Role assignment: In case of federation(and perhaps other solutions) it is not
+  feasible/scalable to do role assignment to users. Role assignment to groups
+  is better. Role assignment will be done usually based on group. KeyStone
+  supports this.
+
+- Amount of inter region traffic: should be kept as little as possible,
+  consider CERNs Ceilometer issue as described in
+http://openstack-in-production.blogspot.se/2014/03/cern-cloud-architecture-update-for.html
+
+Requirement analysis
+===========================
+
+- A user is provided with a single authentication URL to the Identity
+  (Keystone) service. Using that URL, the user authenticates with Keystone by
+requesting a token typically using username/password credentials. The keystone
+server validates the credentials, possibly with an external LDAP/AD server and
+returns a token to the user. With token type UUID/Fernet, the user request the
+service catalog. With PKI tokens the service catalog is included in the token.
+The user sends a request to a service in a selected region including the token.
+Now the service in the region, say Nova needs to validate the token. Nova uses
+its configured keystone endpoint and service credentials to request token
+validation from Keystone. The Keystone token validation should preferably be
+done in the same region as Nova itself. Now Keystone has to validate the token
+that also (always?) includes a project ID in order to make sure the user is
+authorized to use Nova. The project ID is stored in the assignment backend -
+tables in the Keystone SQL database. For this project ID validation the
+assignment backend database needs to have the same content as the keystone who
+issued the token.
+
+- So either 1) services in all regions are configured with a central keystone
+  endpoint through which all token validations will happen. or 2) the Keystone
+assignment backend database is replicated and thus available to Keystone
+instances locally in each region.
+
+  Alt 2) is obviously the only scalable solution that produce no inter region
+traffic for normal service usage. Only when data in the assignment backend is
+changed, replication traffic will be sent between regions. Assignment data
+includes domains, projects, roles and role assignments.
+
+Keystone deployment:
+
+    - Centralized: a single Keystone service installed in some location, either
+      in a "master" region or totally external as a service to OpenStack
+      regions.
+    - Distributed: a Keystone service is deployed in each region
+
+Token types:
+
+    - UUID: tokens are persistently stored and creates a lot of database
+      traffic, the persistence of token is for the revoke purpose. UUID tokens
+are online validated by Keystone, each API calling to service will ask token
+validation from KeyStone. Keystone can become a bottleneck in a large system
+due to this. UUID token type is not suitable for use in multi region clouds at
+all, no matter the solution used for the Keystone database replication (or
+not). UUID tokens have a fixed size.
+
+    - PKI: tokens are non persistent cryptographic based tokens and offline
+      validated (not by the Keystone service) by Keystone middleware
+which is part of other services such as Nova. Since PKI tokens include endpoint
+for all services in all regions, the token size can become big.There are
+several ways to reduce the token size, no catalog policy, endpoint filter to
+make a project binding with limited endpoints, and compressed PKI token - PKIZ,
+but the size of token is still predictable, make it difficult to manage. If no
+catalog applied, that means the user can access all regions, in some scenario,
+it's not allowed to do like this.
+
+    - Fernet: tokens are non persistent cryptographic based tokens and online
+      validated by the Keystone service. Fernet tokens are more lightweigth
+then PKI tokens and have a fixed size.
+
+    PKI (offline validated) are needed with a centralized Keystone to avoid
+inter region traffic. PKI tokens do produce Keystone traffic for revocation
+lists.
+
+    Fernet tokens requires Keystone deployed in a distributed manner, again to
+avoid inter region traffic.
+
+    Cryptographic tokens brings new (compared to UUID tokens) issues/use-cases
+like key rotation, certificate revocation. Key management is out of scope of
+this use case.
+
+Database deployment:
+
+    Database replication:
+    -Master/slave asynchronous: supported by the database server itself
+(mysql/mariadb etc), works over WAN, it's more scalable
+    -Multi master synchronous: Galera(others like percona), not so scalable,
+for multi-master writing, and need more parameter tunning for WAN latency.
+    -Symmetrical/asymmetrical: data replicated to all regions or a subset,
+in the latter case it means some regions needs to access Keystone in another
+region.
+
+    Database server sharing:
+    In an OpenStack controller normally many databases from different
+services are provided from the same database server instance. For HA reasons,
+the database server is usually synchronously replicated to a few other nodes
+(controllers) to form a cluster. Note that _all_ database are replicated in
+this case, for example when Galera sync repl is used.
+
+    Only the Keystone database can be replicated to other sites. Replicating
+databases for other services will cause those services to get of out sync and
+malfunction.
+
+    Since only the Keystone database is to be sync replicated to another
+region/site, it's better to deploy Keystone database into its own
+database server with extra networking requirement, cluster or replication
+configuration. How to support this by installer is out of scope.
+
+    The database server can be shared when async master/slave repl is used, if
+global transaction identifiers GTID is enabled.
+
+
+Candidate solution analysis
+------------------------------------
+
+-  KeyStone service (Distributed) with Fernet token
+
+    Fernet token is a very new format, and just introduced recently,the biggest
+gain for this token format is :1) lightweight, size is small to be carried in
+the API request, not like PKI token( as the sites increased, the endpoint-list
+will grows  and the token size is too long to carry in the API request) 2) no
+token persistence, this also make the DB not changed too much and with light
+weight data size (just project. User, domain, endpoint etc). The drawback for
+the Fernet token is that token has to be validated by KeyStone for each API
+request.
+
+    This makes that the DB of KeyStone can work as a cluster in multisite (for
+example, using MySQL galera cluster). That means install KeyStone API server in
+each site, but share the same the backend DB cluster.Because the DB cluster
+will synchronize data in real time to multisite, all KeyStone server can see
+the same data.
+
+    Because each site with KeyStone installed, and all data kept same,
+therefore all token validation could be done locally in the same site.
+
+    The challenge for this solution is how many sites the DB cluster can
+support. Question is aksed to MySQL galera developers, their answer is that no
+number/distance/network latency limitation in the code. But in the practice,
+they have seen a case to use MySQL cluster in 5 data centers, each data centers
+with 3 nodes.
+
+    This solution will be very good for limited sites which the DB cluster can
+cover very well.
+
+-  KeyStone service(Distributed) with Fernet token + Async replication (
+   multi-cluster mode).
+
+    We may have several KeyStone cluster with Fernet token, for example,
+cluster1 ( site1, site2, … site 10 ), cluster 2 ( site11, site 12,..,site 20).
+Then do the DB async replication among different cluster asynchronously.
+
+    A prototype of this has been down on this. In some blogs they call it
+"hybridreplication". Architecturally you have a master region where you do
+keystone writes. The other regions is read-only.
+http://severalnines.com/blog/deploy-asynchronous-slave-galera-mysql-easy-way
+http://severalnines.com/blog/replicate-mysql-server-galera-cluster
+
+    Only one DB cluster (the master DB cluster) is allowed to write(but still
+multisite, not all sites), other clusters waiting for replication. Inside the
+master cluster, "write" is allowed in multiple region for the distributed lock
+in the DB. But please notice the challenge of key distribution and rotation for
+Fernet token, you can refer to these two blogs: http://lbragstad.com/?p=133,
+http://lbragstad.com/?p=156
+
+-  KeyStone service(Distributed) with Fernet token + Async replication (
+   star-mode).
+
+    one master KeyStone cluster with Fernet token in two sites (for site level
+high availability purpose), other sites will be installed with at least 2 slave
+nodes where the node is configured with DB async replication from the master
+cluster members, and one slave’s mater node in site1, another slave’s master
+node in site 2.
+
+    Only the master cluster nodes are allowed to write,  other slave nodes
+waiting for replication from the master cluster ( very little delay) member.
+But  the chanllenge of key distribution and rotation for Fernet token should be
+settled, you can refer to these two blogs: http://lbragstad.com/?p=133,
+http://lbragstad.com/?p=156
+
+    Pros.
+    Why cluster in the master sites? There are lots of master nodes in the
+cluster, in order to provide more slaves could be done with async. replication
+in parallel.  Why two sites for the master cluster? to provide higher
+reliability (site level) for writing request.
+    Why using multi-slaves in other sites. Slave has no knowledge of other
+slaves, so easy to manage multi-slaves in one site than a cluster, and
+multi-slaves work independently but provide multi-instance redundancy(like a
+cluster, but independent).
+
+    Cons. The distribution/rotation of key management.
+
+-  KeyStone service(Distributed) with PKI token
+
+    The PKI token has one great advantage is that the token validation can be
+done locally, without sending token validation request toKeyStone server. The
+drawback of PKI token is 1) the endpoint list size in the token. If a project
+will be only spread in very limited site number(region number), then we can use
+the endpoint filter to reduce the token size, make it workable even a lot of
+sites in the cloud. 2) KeyStone middleware(the old KeyStone client, which
+co-locate in Nova/xxx-API) will have to send the request to the KeyStone server
+frequently for the revoke-list, in order to reject some malicious API request,
+for example, a user has be deactivated, but use an old token to access
+OpenStack service.
+
+    For this solution, except above issues, we need also to provide KeyStone
+Active-Active mode across site to reduce the impact of site failure. And the
+revoke-list request is very frequently asked, so the performance of the
+KeyStone server needs also to be taken care.
+
+    Site level keystone load balance is required to provide site level
+redundancy. Otherwise the KeyStone middleware will not switch request to the
+health KeyStone server in time.
+
+    This solution can be used for some scenario, especially a project only
+spread in limited sites ( regions ).
+
+    And also the cert distribution/revoke to each site / API server for token
+validation is required.
+
+-  KeyStone service(Distributed) with UUID token
+
+    Because each token validation will be sent to KeyStone server,and the token
+persistence also makes the DB size larger than Fernet token, not so good as the
+fernet token to provide a distributed KeyStone service. UUID is a solution
+better for small scale and inside one site.
+
+    Cons: UUID tokens are persistently stored so will cause a lot of inter
+region replication traffic, tokens will be persisted for authorization and
+revoke purpose, the frequent changed database leads to a lot of inter region
+replication traffic.
+
+-  KeyStone service(Distributed) with Fernet token + KeyStone federation You
+    have to accept the drawback of KeyStone federation if you have a lot of
+sites/regions. Please refer to KeyStone federation section
+
+-  KeyStone federation
+    In this solution, we can install KeyStone  service in each site and with
+its own database. Because we have to make the KeyStone IdP and SP know each
+other, therefore the configuration needs to be done accordingly, and setup the
+role/domain/group mapping, create regarding region in the pair.As sites
+increase, if each user is able to access all sites, then full-meshed
+mapping/configuration has to be done. Whenever you add one more site, you have
+to do n*(n-1) sites configuration/mapping. The complexity will be great enough
+as the sites number increase.
+
+    KeyStone Federation is mainly for different cloud admin to borrow/rent
+resources, for example, A company and B company, A private cloud and B public
+cloud, and both of them using OpenStack based cloud. Therefore a lot of mapping
+and configuration has to be done to make it work.
+
+-  KeyStone service (Centralized)with Fernet token
+
+    cons: inter region traffic for token validation, token validation requests
+from all other sites has to be sent to the centralized site. Too frequent inter
+region traffic.
+
+-  KeyStone service(Centralized) with PKI token
+
+    cons: inter region traffic for tokenrevocation list management, the token
+revocation list request from all other sites has to be sent to the centralized
+site. Too frequent inter region traffic.
+
+-  KeyStone service(Centralized) with UUID token
+
+    cons: inter region traffic for token validation, the token validation
+request from all other sites has to be sent to the centralized site. Too
+frequent inter region traffic.
+
+Prototype
+-----------
+    A prototype of the candidate solution "KeyStone service(Distributed) with
+Fernet token + Async replication ( multi-cluster mode)" has been executed Hans
+Feldt and Chaoyi Huang, please refer to https://github.com/hafe/dockers/ . And
+one issue was found "Can't specify identity endpoint for token validation among
+several keystone servers in keystonemiddleware", please refer to the Gaps
+section.
+
+Gaps
+====
+    Can't specify identity endpoint for token validation among several keystone
+servers in keystonemiddleware.
+
+
+**NAME-THE-MODULE issues:**
+
+* keystonemiddleware
+
+  * Can't specify identity endpoint for token validation among several keystone
+  * servers in keystonemiddleware:
+  * https://bugs.launchpad.net/keystone/+bug/1488347
+
+Affected By
+-----------
+    OPNFV multisite cloud.
+
+Conclusion
+-----------
+
+    As the prototype demonstrate the cluster level aysn. replication capability
+and fernet token validation in local site is feasible. And the candidate
+solution "KeyStone service(Distributed) with Fernet token + Async replication (
+star-mode)" is simplified solution of the prototyped one, it's much more easier
+in deployment and maintenance, with better scalability.
+
+    Therefore the candidate solution "KeyStone service(Distributed) with Fernet
+token + Async replication ( star-mode)" for multsite OPNFV cloud is
+recommended.
+
+References
+==========
+
+    There are 3 format token (UUID, PKI/PKIZ, Fernet) provided byKeyStone, this
+blog give a very good description, benchmark and comparation:
+    http://dolphm.com/the-anatomy-of-openstack-keystone-token-formats/
+    http://dolphm.com/benchmarking-openstack-keystone-token-formats/
+
+    To understand the benefit and shortage of PKI/PKIZ token, pleaserefer to :
+    https://www.mirantis.com/blog/understanding-openstack-authentication-keystone-pk
+
+    To understand KeyStone federation and how to use it:
+    http://blog.rodrigods.com/playing-with-keystone-to-keystone-federation/
+
+    To integrate KeyStone with external enterprise ready authentication system
+    https://blog-nkinder.rhcloud.com/?p=130.
+
+    Key repliocation used in KeyStone Fernet token
+    http://lbragstad.com/?p=133,
+    http://lbragstad.com/?p=156
+
+    KeyStone revoke
+    http://specs.openstack.org/openstack/keystone-specs/api/v3/identity-api-v3-os-revoke-ext.html
diff --git a/docs/requirements/multisite-vnf-gr-requirement.rst b/docs/requirements/multisite-vnf-gr-requirement.rst
new file mode 100644
index 0000000..7e67cd0
--- /dev/null
+++ b/docs/requirements/multisite-vnf-gr-requirement.rst
@@ -0,0 +1,241 @@
+This work is licensed under a Creative Commons Attribution 3.0 Unported License.
+http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+=========================================
+ Multisite VNF Geo site disaster recovery
+=========================================
+
+Glossary
+========
+
+
+There are serveral concept required to be understood first
+    **Volume Snapshot**
+
+    **Volume Backup**
+
+    **Volume Replication**
+
+    **VM Snapshot**
+
+Please refer to reference section for these concept and comparison.
+
+
+Problem description
+===================
+
+Abstract
+------------
+
+a VNF (telecom application) should, be able to restore in another site for
+catastrophic failures happened.
+
+Description
+------------
+GR is to deal with more catastrophic failures (flood, earthquake, propagating
+software fault), and that loss of calls, or even temporary loss of service,
+is acceptable. It is also seems more common to accept/expect manual /
+administrator intervene into drive the process, not least because you don’t
+want to trigger the transfer by mistake.
+
+In terms of coordination/replication or backup/restore between geographic
+sites, discussion often (but not always) seems to focus on limited application
+level data/config replication, as opposed to replication backup/restore between
+of cloud infrastructure between different sites.
+
+And finally, the lack of a requirement to do fast media transfer (without
+resignalling) generally removes the need for special networking behavior, with
+slower DNS-style redirection being acceptable.
+
+This use case is more concerns about cloud infrastructure level capability to
+support VNF geo site redundancy
+
+Requirement and candidate solutions analysis
+============================================
+
+For VNF to be restored from the backup site for catastrophic failures,
+the VNF's bootable volume and data volumes must be restorable.
+
+There are three ways of restorable boot and data volumes. Choosing the right
+one largely depends on the underlying characteristics and requirements of a
+VNF.
+
+1. Nova Quiesce + Cinder Consistency volume snapshot+ Cinder backup
+   1).GR(Geo site disaster recovery )software get the volumes for each VM
+   in the VNF from Nova
+   2).GR software call Nova quiesce API to quarantee quiecing VMs in desired
+   order
+   3).GR software takes snapshots of these volumes in Cinder (NOTE: Because
+   storage often provides fast snapshot, so the duration between quiece and
+   unquiece is a short interval)
+   4).GR software call Nova unquiece API to unquiece VMs of the VNF in reverse
+   order
+   5).GR software create volumes from the snapshots just taken in Cinder
+   6).GR software create backup (incremental) for these volumes to remote
+   backup storage ( swift or ceph, or.. ) in Cinder
+   7).if this site failed,
+   7.1)GR software restore these backup volumes in remote Cinder in the
+   backup site.
+   7.2)GR software boot VMs from bootable volumes from the remote Cinder in
+   the backup site and attach the regarding data volumes.
+
+Pros: Quiesce / unquiesce api from Nova, make transactional snapshot
+of a group of VMs is possible, for example, quiesce VM1, quiesce VM2,
+quiesce VM3, snapshot VM1's volumes, snapshot VM2's volumes, snapshot
+VM3's volumes, unquiesce VM3, unquiesce VM2, unquiesce VM1. For some
+telecom application, the order is very important for a group of VMs
+with strong relationship.
+
+Cons: Need Nova to expose the quiesce / unquiesce, fortunately it's alreay
+there in Nova-compute, just to add API layer to expose the functionality.
+NOTE: It's up to the DR policy and VNF character. Some VNF may afford short
+unavailable for DR purpose, and some other may use the standby of the VNF
+or member of the cluster to do disaster recovery replication to not interfere
+the service provided by the VNF. For these VNFs which can't be quieced/unquiece
+should use the option3 (VNF aware) to do the backup/replication.
+
+Requirement to OpenStack: Nova needs to expose quiesce / unquiesce api,
+which is lack in Nova now.
+
+Example characteristics and requirements of a VNF:
+    - VNF requires full data consistency during backup/restore process -
+      entire data should be replicated.
+    - VNF's data changes infrequently, which results in less number of volume
+      snapshots during a given time interval (hour, day, etc.);
+    - VNF is not highly dynamic, e.g. the number of scaling (in/out) operations
+      is small.
+    - VNF is not geo-redundant, does not aware of available cloud replication
+      mechanisms, has no built-in logic for replication: doesn't pre-select the
+      minimum replication data required for restarting the VNF in a different
+      site.
+      (NOTE: The VNF who can perform such data cherry picking should consider
+      case 3)
+
+2. Nova Snapshot + Glance Image + Cinder Snapshot + Cinder Backup
+    - GR software create VM snapshot in Nova
+    - Nova quiece the VM internally
+      (NOTE: The upper level application or GR software should take care of
+      avoiding infra level outage induced VNF outage)
+    - Nova create image in Glance
+    - Nova create a snapshot of the VM, including volumes
+    - If the VM is volume backed VM, then create volume snapshot in Cinder
+    - No image uploaded to glance, but add the snapshot in the meta data of the
+      image in Glance
+    - GR software to get the snapshot information from the Glance
+    - GR software create volumes from these snapshots
+    - GR software create  backup (incremental) for these volumes to backup
+      storage( swift or ceph, or.. ) in Cinder if this site failed,
+    - GR software restore these backup volumes to Cinder in the backup site.
+    - GR software boot vm from bootable volume from Cinder in the backup site
+      and attach the data volumes.
+
+Pros: 1) Automatically quiesce/unquiesce, and snapshot of volumes of one VM.
+
+Cons: 1) Impossible to form a transactional group of VMs backup.  for example,
+         quiesce VM1, quiesce VM2, quiesce VM3, snapshot VM1, snapshot VM2,
+         snapshot VM3, unquiesce VM3, unquiesce VM2, unquiesce VM1. This is
+         quite important in telecom application in some scenario
+      2) not leverage the Cinder consistency group.
+      3) One more service Glance involved in the backup. Not only to manage the
+         increased snapshot in Cinder, but also need to manage the regarding
+         temporary image in Glance.
+
+Requirement to OpenStack: None.
+
+Example: It's suitable for single VM backup/restore, for example, for the small
+scale configuration database virtual machine which is running in active/standby
+model. There is very rare use case for application that only one VM need to be
+taken snapshot for back up.
+
+3. Selective Replication of Persistent Data
+    - GR software creates datastore (Block/Cinder, Object/Swift, App Custom
+      storage) with replication enabled at the relevant scope, for use to
+      selectively backup/replicate desire data to GR backup site
+       - Cinder : Various work underway to provide async replication of cinder
+         volumes for disaster recovery use, including this presentation from
+         Vancouver http://www.slideshare.net/SeanCohen/dude-wheres-my-volume-open-stack-summit-vancouver-2015
+       - Swift : Range of options of using native Swift replicas (at expense of
+         tighter coupling) to replication using backend plugins or volume
+         replication
+       - Custom : A wide range of OpenSource technologies including Cassandra
+         and Ceph, with fully application level solutions also possible
+    - GR software get the reference of storage in the remote site storage
+    - If primary site failed,
+       - GR software managing recovery in backup site gets references to
+         relevant storage and passes to new software instances
+       - Software attaches (or has attached) replicated storage, in the case of
+         volumes promoting to writable.
+
+Pros:  1) Replication will be done in the storage level automatically, no need
+          to create backup regularly, for example, daily.
+       2) Application selection of limited amount of data to replicate reduces
+          risk of replicating failed state and generates less overhear.
+       3) Type of replication and model (active/backup, active/active, etc) can
+          be tailored to application needs
+
+Cons:  1) Applications need to be designed with support in mind, including both
+          selection of data to be replicated and consideration of consistency
+       2) "Standard" support in Openstack for Disaster Recovery currently
+          fairly limited, though active work in this area.
+
+Requirement to OpenStack: save the real ref to volume admin_metadata after it
+has been managed by the driver    https://review.openstack.org/#/c/182150/.
+
+Prototype
+-----------
+    None.
+
+Proposed solution
+-----------
+
+    requirements perspective we could recommend all three options for different
+    sceanrio, that it is an operator choice.
+    Options 1 & 2 seem to be more about replicating/backing up any VNF, whereas
+    option 3 is about proving a service to a replication aware application. It
+    should be noted that HA requirement is not a priority here, HA for VNF
+    project will handle the specific HA requirement. It should also be noted
+    that it's up to specific application how to do HA (out of scope here).
+    For the 3rd option, the app should know which volume has replication
+    capability, and write regarding data to this volume, and guarantee
+    consistency by the app itself. Option 3 is preferrable in HA scenario.
+
+
+Gaps
+====
+    1) Nova to expose quiesce / unquiesce API:
+       https://blueprints.launchpad.net/nova/+spec/expose-quiesce-unquiesce-api
+    2)  Get the real ref to volume admin_metadata in Cinder:
+       https://review.openstack.org/#/c/182150/
+
+
+**NAME-THE-MODULE issues:**
+
+* Nova
+
+Affected By
+-----------
+    OPNFV multisite cloud.
+
+References
+==========
+
+   Cinder snapshot ( no material/BP about snapshot itself availble from web )
+   http://docs.openstack.org/cli-reference/content/cinderclient_commands.html
+
+
+   Cinder volume backup
+   https://blueprints.launchpad.net/cinder/+spec/volume-backups
+
+   Cinder incremtal backup
+   https://blueprints.launchpad.net/cinder/+spec/incremental-backup
+
+   Cinder volume replication
+   https://blueprints.launchpad.net/cinder/+spec/volume-replication
+
+    Create VM snapshot with volume backed ( not found better matrial to explain
+    the volume backed VM snapshot, only code tells )
+    https://bugs.launchpad.net/nova/+bug/1322195
+
+    Cinder consistency group
+    https://github.com/openstack/cinder-specs/blob/master/specs/juno/consistency-groups.rst
author	Chaoyi Huang <joehuang@huawei.com>	2015-12-22 13:24:10 +0800
committer	Chaoyi Huang <joehuang@huawei.com>	2016-01-04 11:01:31 +0800
commit	8e3d1f151aa2ba629c9ed4ad61862ac147fff3ec (patch)
tree	cc042c0c9934f13c8a6d3f868421ec38550111d8 /docs/requirements
parent	1a148540057b9bcbc32865217319021ba09ae07b (diff)