summaryrefslogtreecommitdiffstats
path: root/requirements/04-gaps.rst
blob: 38dcbd27b2bb184ad5d2267b3a89cdd003cc2815 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
Gap analysis in upstream projects
=================================

This section presents the findings of gaps on existing VIM platforms. The focus
was to identify gaps based on the features and requirements specified in Section
3.3. The analysis work determined gaps that are presented here.

VIM Northbound Interface
------------------------

Immediate Notification
^^^^^^^^^^^^^^^^^^^^^^

* Type: 'deficiency in performance'
* Description

  + To-be

    - VIM has to notify unavailability of virtual resource (fault) to VIM user
      immediately.
    - Notification should be passed in '1 second' after fault detected/notified
      by VIM.
    - Also, the following conditions/requirement have to be met:

      - Only the owning user can receive notification of fault related to owned
        virtual resource(s).

  + As-is

    - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
      resource (fault) to the owner of virtual resource based on alarm
      configuration by the user.

      - Ceilometer Alarm API:
        http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms

    - Alarm notifications are triggered by alarm evaluator instead of
      notification agents that might receive faults

      - Ceilometer Architecture:
        http://docs.openstack.org/developer/ceilometer/architecture.html#id1

    - Evaluation interval should be equal to or larger than configured pipeline
      interval for collection of underlying metrics.

      - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42

    - The interval for collection has to be set large enough which depends on
      the size of the deployment and the number of metrics to be collected.
    - The interval may not be less than one second in even small deployments.
      The default value is 60 seconds.
    - Alternative: OpenStack has a message bus to publish system events.
      The operator can allow the user to connect this, but there are no
      functions to filter out other events that should not be passed to the user
      or which were not requested by the user.

  + Gap

    - Fault notifications cannot be received immediately by Ceilometer.

Maintenance Notification
^^^^^^^^^^^^^^^^^^^^^^^^

* Type: 'missing'
* Description

  + To-be

    - VIM has to notify unavailability of virtual resource triggered by NFVI
      maintenance to VIM user.
    - Also, the following conditions/requirements have to be met:

      - VIM should accept maintenance message from administrator and mark target
        physical resource "in maintenance".
      - Only the owner of virtual resource hosted by target physical resource
        can receive the notification that can trigger some process for
        applications which are running on the virtual resource (e.g. cut off
        VM).

  + As-is

    - OpenStack: None
    - AWS (just for study)

      - AWS provides API and CLI to view status of resource (VM) and to create
        instance status and system status alarms to notify you when an instance
        has a failed status check.
        http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
      - AWS provides API and CLI to view scheduled events, such as a reboot or
        retirement, for your instances. Also, those events will be notified
        via e-mail.
        http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html

  + Gap

    - VIM user cannot receive maintenance notifications.

VIM Southbound interface
------------------------

Normalization of data collection models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Type: 'missing'
* Description

  + To-be

    - A normalized data format needs to be created to cope with the many data
      models from different monitoring solutions.

  + As-is

    - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
      Zenoss). Although each solution establishes its own data models, no common
      data abstraction models exist in OpenStack.

  + Gap

    - Normalized data format does not exist.

OpenStack
---------

Ceilometer
^^^^^^^^^^

OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
the utilization of physical and virtual resources [CEIL]_. Ceilometer can
collect a number of metrics across multiple OpenStack components and watch for
variations and trigger alarms based upon the collected data.

Scalability of fault aggregation
________________________________

* Type: 'scalability issue'
* Description

  + To-be

    - Be able to scale to a large deployment, where thousands of monitoring
      events per second need to be analyzed.

  + As-is

    - Performance issue when scaling to medium-sized deployments.

  + Gap

    - Ceilometer seems to be unsuitable for monitoring medium and large scale
      NFVI deployments.

* Related blueprints

  + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
    higher number of fault events (up to 15 thousand events per second, but
    obviously also has some upper bound:
    http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/

  + Decentralized/hierarchical deployment with multiple instances, where one
    instance is only responsible for a small NFVI.

Monitoring of hardware and software
___________________________________

* Type: 'missing (lack of functionality)'
* Description

  + To-be

    - OpenStack (as VIM) should monitor various hardware and software in NFVI to
      handle faults on them by Ceilometer.
    - OpenStack may have monitoring functionality in itself and can be
      integrated with third party monitoring tools.
    - OpenStack need to be able to detect the faults listed in Section 3.5.

  + As-is

    - For each deployment of OpenStack, an operator has responsibility to
      configure monitoring tools with relevant scripts or plugins in order to
      monitor hardware and software.
    - OpenStack Ceilometer does not monitor hardware and software to capture
      faults.

   + Gap

     - Ceilometer is not able to detect and handle all faults listed in Section
       3.5.

* Related blueprints / workarounds

  - Use other dedicated monitoring tools like Zabbix or Monasca

Nova
^^^^

OpenStack Nova [NOVA]_ is a mature and widely known and used component in
OpenStack cloud deployments. It is the main part of an
"infrastructure-as-a-service" system providing a cloud computing fabric
controller, supporting a wide diversity of virtualization and container
technologies.

Nova has proven throughout these past years to be highly available and
fault-tolerant. Featuring its own API, it also provides a compatibility API with
Amazon EC2 APIs.

Correct states when compute host is down
________________________________________

* Type: 'missing (lack of functionality)'
* Description

  + To-be

    - There needs to be API to change VM power_State in case host has failed.
    - There needs to be API to change nova-compute state.
    - There could be single API to change different VM states for all VMs
      belonging to specific host.
    - As external system monitoring the infra calls these APIs change can be
      fast and reliable.
    - Correlation actions can be faster and automated as states are reliable.
    - User will be able to read states from OpenStack and trust they are
      correct.

  + As-is

    - When a VM goes down due to a host HW, host OS or hypervisor failure,
      nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
      reported to be live and OK through the OpenStack API.
    - nova-compute state might change too slowly or the state is not reliable
      if expecting also VMs to be down. This leads to ability to schedule VMs
      to a failed host and slowness blocks evacuation.

  + Gap

    - OpenStack does not change its states fast and reliably enough.
    - There is API missing to have external system to change states and to
      trust the states are then reliable (external system has fenced failed
      host).
    - User cannot read all the states from OpenStack nor trust they are right.

* Related blueprints

  + https://blueprints.launchpad.net/nova/+spec/mark-host-down
  + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service

Evacuate VMs in Maintenance mode
________________________________

* Type: 'missing'
* Description

  + To-be

    - When maintenance mode for a compute host is set, trigger VM evacuation to
      available compute nodes before bringing the host down for maintenance.

  + As-is

    - If setting a compute node to a maintenance mode, OpenStack only schedules
      evacuation of all VMs to available compute nodes if in-maintenance compute
      node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
      KVM) are not supported and, hence, guest VMs will likely stop running due
      to maintenance actions administrator may perform (e.g. hardware upgrades,
      OS updates).

  + Gap

    - Nova libvirt hypervisor driver does not implement automatic guest VMs
      evacuation when compute nodes are set to maintenance mode (``$ nova
      host-update --maintenance enable <hostname>``).

Monasca
^^^^^^^

Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
integrates with OpenStack. Even though it is still in its early days, it is the
interest of the community that the platform be multi-tenant, highly scalable,
performant and fault-tolerant. It provides a streaming alarm engine, a
notification engine, and a northbound REST API users can use to interact with
Monasca. Hundreds of thousands of metrics per second can be processed
[MONA]_.

Anomaly detection
_________________


* Type: 'missing (lack of functionality)'
* Description

  + To-be

    - Detect the failure and perform a root cause analysis to filter out other
      alarms that may be triggered due to their cascading relation.

  + As-is

    - A mechanism to detect root causes of failures is not available.

  + Gap

    - Certain failures can trigger many alarms due to their dependency on the
      underlying root cause of failure. Knowing the root cause can help filter
      out unnecessary and overwhelming alarms.

* Related blueprints / workarounds

  + Monasca as of now lacks this feature, although the community is aware and
    working toward supporting it.

Sensor monitoring
_________________

* Type: 'missing (lack of functionality)'
* Description

  + To-be

    - It should support monitoring sensor data retrieval, for instance, from
      IPMI.

  + As-is

    - Monasca does not monitor sensor data

  + Gap

    - Sensor monitoring is very important. It provides operators status
      on the state of the physical infrastructure (e.g. temperature, fans).

* Related blueprints / workarounds

  + Monasca can be configured to use third-party monitoring solutions (e.g.
    Nagios, Cacti) for retrieving additional data.

Hardware monitoring tools
-------------------------

Zabbix
^^^^^^

Zabbix is an open-source solution for monitoring availability and performance of
infrastructure components (i.e. servers and network devices), as well as
applications [ZABB]_. It can be customized for use with OpenStack. It is a
mature tool and has been proven to be able to scale to large systems with
100,000s of devices.

Delay in execution of actions
_____________________________


* Type: 'deficiency in performance'
* Description

  + To-be

    - After detecting a fault, the monitoring tool should immediately execute
      the appropriate action, e.g. inform the manager through the NB I/F

  + As-is

    - A delay of around 10 seconds was measured in two independent testbed
      deployments

  + Gap

    - Cause of the delay needs to be identified and fixed

..
 vim: set tabstop=4 expandtab textwidth=80: