summaryrefslogtreecommitdiffstats
path: root/docs/scenarios/GAP_Analysis_Colorado.rst
blob: 4fefc09147df732847d74495cb09b22f5801125a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
Introduction:
^^^^^^^^^^^^^

During the Colorado release the OPNFV availability team has reviewed a number of gaps
in support for high availability in various areas of OPNFV.  The focus and goal was
to find gaps and work with the various open source communities( OpenStack as an
example ) to develop solutions and blueprints.  This would enhance the overall
system availability and reliability of OPNFV going forward.  We also worked with
the OPNFV Doctor team to ensure our activities were coordinated.  In the next
releases of OPNFV the availability team will update the status of open gaps and
continue to look for additional gaps.

Summary of findings:
^^^^^^^^^^^^^^^^^^^^

1. Publish health status of compute node - this gap is now closed through and
OpenStack blueprint in Mitaka

2. Health status of compute node - some good work underway in OpenStack and with
the Doctor team, we will continue to monitor this work.

3. Store consoleauth tokens to the database - this gap can be address through
changing OpenStack configurations

4. Active/Active HA of cinder-volume - active work underway in Newton, we will
monitor closely

5. Cinder volume multi-attachment - this work has been completed in OpenStack -
this gap is now closed

6. Add HA tests into Fuel - the Availability team has been working with the
Yardstick team to create additional test case for the Colorado release.  Some of
these test cases would be good additions to installers like Fuel.

Detailed explanation of the gaps and findings:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

GAP 1: Publish the health status of compute node
================================================

* Type: 'reliability'
* Description:

   Current compute node status is only kept within nova. However, NFVO and VNFM
   may also need these information. For example, NFVO may trigger scale up/down
   based on the status. VNFM may trigger evacuation. In the meantime, in the
   high availability scenarios, VNFM may need the host status info from the VIM
   so that it can figure out what the failure exactly located. Therefore, these
   info need to be published outside to the NFVO and VNFM.

 + Desired state

   - Be able to have the health status of compute nodes published.

 + Current behaviour

   - Nova queries the ServiceGroup API to get the node liveness information.

 + Gap

- Currently Service Group is keeping the health status of compute nodes internal
- within nova, could have had those status published to NFV MANO plane.

Findings:

BP from the OPNFV Doctor team has covered this GAP. Add notification for service
status change.

Status: Merged (Mitaka release)

 + Owner: Balazs

 + BP: https://blueprints.launchpad.net/nova/+spec/service-status-notification

 + Spec: https://review.openstack.org/182350

 + Code: https://review.openstack.org/#/c/245678/

 + Merged Jan 2016 - Mitaka

GAP 2: Health status of compute node
====================================

* Type: 'reliability'
* Description:

 + Desired state:

   - Provide the health status of compute nodes.

 + Current Behaviour

   - Currently , while performing some actions like evacuation, Nova is
   checking for the compute service. If the service is down,it is assumed
   the host is down. This is not exactly true, since there is a possibility
   to only have compute service down, while all VMs that are running on the
   host, are actually up. There is no way to distinguish between two really
   different things: host status and nova-compute status, which is deployed
   on the host.
   - Also, provided host information by API and commands, are service centric,
   i.e."nova host-list" is just another wrapper for "nova service-list" with
   different format (in fact "service-list" is a super set to "host-list").


 + Gap

   - Not all the health information of compute nodes can be provided. Seems like
   nova is treating *host* term equally to *compute-host*, which might be misleading.
   Such situations can be error prone for the case where there is a need to perform
   host evacuation.


Related BP:

Pacemaker and Corosync can provide info about the host. Therefore, there is
requirement to have nova support the pacemaker service group driver. There could
be another option by adding tooz servicegroup driver to nova, and then have to
support corosync driver.

  + https://blueprints.launchpad.net/nova/+spec/tooz-for-service-groups

Doctor team is not working on this blueprint

NOTE: This bp is active. A suggestion is to adopt this bp and add a corosync
driver to tooz. Could be a solution.

We should keep following this bp, when it finished, see if we could add a
corosync driver for tooz to close this gap.

Here are the currently supported driver in tooz.
https://github.com/openstack/tooz/blob/master/doc/source/drivers.rst Meanwhile,
we should also look into the doctor project and see if this could be solved.

This work is still underway, but, doesn't directly map to the gap that it is
identified above.  Doctor team looking to get faster updates on node status and
failure status - these are other blueprints.  These are good problems to solve.

GAP 3: Store consoleauth tokens to the database
===============================================

* Type: 'performance'
* Description:

+ Desired state

   - Change the consoleauth service to store the tokens in the databaseand, optionally,
   cache them in memory as it does now for fast access.

+ Current State

   - Currently the consoleauth service is storing the tokens and theconnection data
   only in memory. This behavior makes impossible to have multipleinstances of this
   service in a cluster as there is no way for one of theisntances to know the tokens
   issued by the other.

   - The consoleauth service can use a memcached server to store those tokens,but again,
   if we want to share them among different instances of it we would berelying in one
   memcached server which makes this solution unsuitable for a highly available
   architecture where we should be able to replicate all ofthe services in our cluster.

+ Gap

   - The consoleauth service is storing the tokens and the connection data only in memory.
   This behavior makes impossible to have multiple instances of this service in a cluster
   as there is no way for one of the instances to know the tokens issued by the other.

* Related BP

 + https://blueprints.launchpad.net/nova/+spec/consoleauth-tokens-in-db

 The advise in the blueprint is to use memcached as a backend. Looking to the
 documentation memcached is not able to replicate data, so this is not a
 complete solution. But maybe redis (http://redis.io/) is a suitable backend
 to store the tokens that survive node failures.  This blueprint is not
 directly needed for this gap.

Findings:

This bp has been rejected since the community feedback is that A/A can be
supported by memcacheD. The usecase for this bp is not quite clear, since when
the consoleauth service is done and the token is lost, the other service can
retrieve the token again after it recovers.  Can be accomplished through a
different configuration set up for OpenStack.  Therefore not a gap.
Recommendation of the team is to verify the redis approach.


GAP 4: Active/Active HA of cinder-volume
========================================

* Type: 'reliability/scalability'

* Description:

 + Desired State:

   - Cinder-volume can run in an active/active configuration.

 + Current State:

   - Only one cinder-volume instance can be active. Failover to be handledby
   external mechanism such as pacemaker/corosync.

 + Gap

   - Cinder-volume doesn't supprt active/active configuration.

* Related BP

  + https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support

* Findings:

  + This blueprint underway for Newton - as of July 6, 2016 great progress has
  been made, we will continue to monitor the progress.

GAP 5: Cinder volume multi-attachment
=====================================

* Type: 'reliability'
* Description:

 + Desired State

   - Cinder volumes can be attached to multiple VMs at the same time.  So that
   active/standby stateful VNFs can share the same Cinder volume.

 + Current State

   - Cinder volumes can only be attached to one VM at a time.

 + Gap

   - Nova and cinder do not allow for multiple simultaneous attachments.

* Related BP

  + https://blueprints.launchpad.net/openstack/?searchtext=multi-attach-volume

* Findings

  + Multi-attach volume is still WIP in OpenStack.  There is coordination work required with Nova.
  + At risk for Newton
  + Recommend adding a Yardstick test case.

General comment for the next release.  Remote volume replication is another
important project for storage HA.
The HA team will monitor this multi-blueprint activity that will span multiple
OpenStack releases.  The blueprints aren't approved yet and there dependencies
on generic-volume-group.



GAP 6: HA tests improvements in fuel
====================================

* Type: 'robustness'
* Description:

  + Desired State
    - Increased test coverage for HA during install
  + Current State
    - A few test cases are available

  * Related BP

    - https://blueprints.launchpad.net/fuel/+spec/ha-test-improvements
    - Tie in with the test plans we have discussed previously.
    - Look at Yardstick tests that could be proposed back to Openstack.
    - Discussions planned with Yardstick team to engage with Openstack community to enhance Fuel or Tempest as appropriate.


Next Steps:
^^^^^^^^^^^

The six gaps above demonstrate that on going progress is being made in various
OPNFV and OpenStack communities.  The OPNFV-HA team will work to suggest
blueprints for the next OpenStack Summit to help continue the progress of high
availability in the community.