summaryrefslogtreecommitdiffstats
path: root/design_docs/notification-alarm-evaluator.rst
blob: 750e39c0ad2a9294b92922c7380562c2b05c23fb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

============================
Notification Alarm Evaluator
============================

.. NOTE::
   This is spec draft of brlueprint for OpenStack Ceilomter Liberty.
   To see current version: https://review.openstack.org/172893
   To track development activity:
   https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator

https://blueprints.launchpad.net/ceilometer/+spec/notification-alarm-evaluator

This blueprint proposes to add a new alarm evaluator for handling alarms on
events passed from other OpenStack services, that provides event-driven alarm
evaluation which makes new sequence in Ceilometer instead of the polling-based
approach of the existing Alarm Evaluator, and realizes immediate alarm
notification to end users.

Problem description
===================

As an end user, I need to receive alarm notification immediately once
Ceilometer captured an event which would make alarm fired, so that I can
perform recovery actions promptly to shorten downtime of my service.
The typical use case is that an end user set alarm on "compute.instance.update"
in order to trigger recovery actions once the instance status has changed to
'shutdown' or 'error'. It should be nice that an end user can receive
notification within 1 second after fault observed as the same as other helth-
check mechanisms can do in some cases.

The existing Alarm Evaluator is periodically querying/polling the databases
in order to check all alarms independently from other processes. This is good
approach for evaluating an alarm on samples stored in a certain period.
However, this is not efficient to evaluate an alarm on events which are emitted
by other OpenStack servers once in a while.

The periodical evaluation leads delay on sending alarm notification to users.
The default period of evaluation cycle is 60 seconds. It is recommended that
an operator set longer interval than configured pipeline interval for
underlying metrics, and also longer enough to evaluate all defined alarms
in certain period while taking into account the number of resources, users and
alarms.

Proposed change
===============

The proposal is to add a new event-driven alarm evaluator which receives
messages from Notification Agent and finds related Alarms, then evaluates each
alarms;

* New alarm evaluator could receive event notification from Notification Agent
  by which adding a dedicated notifier as a publisher in pipeline.yaml
  (e.g. notifier://?topic=event_eval).

* When new alarm evaluator received event notification, it queries alarm
  database by Project ID and Resource ID written in the event notification.

* Found alarms are evaluated by referring event notification.

* Depending on the result of evaluation, those alarms would be fired through
  Alarm Notifier as the same as existing Alarm Evaluator does.

This proposal also adds new alarm type "notification" and "notification_rule".
This enables users to create alarms on events. The separation from other alarm
types (such as "threshold" type) is intended to show different timing of
evaluation and different format of condition, since the new evaluator will
check each event notification once it received whereas "threshold" alarm can
evaluate average of values in certain period calculated from multiple samples.

The new alarm evaluator handles Notification type alarms, so we have to change
existing alarm evaluator to exclude "notification" type alarms from evaluation
targets.

Alternatives
------------

There was similar blueprint proposal "Alarm type based on notification", but
the approach is different. The old proposal was to adding new step (alarm
evaluations) in Notification Agent every time it received event from other
OpenStack services, whereas this proposal intends to execute alarm evaluation
in another component which can minimize impact to existing pipeline processing.

Another approach is enhancement of existing alarm evaluator by adding
notification listener. However, there are two issues; 1) this approach could
cause stall of periodical evaluations when it receives bulk of notifications,
and 2) this could break the alarm portioning i.e. when alarm evaluator received
notification, it might have to evaluate some alarms which are not assign to it.

Data model impact
-----------------

Resource ID will be added to Alarm model as an optional attribute.
This would help the new alarm evaluator to filter out non-related alarms
while querying alarms, otherwise it have to evaluate all alarms in the project.

REST API impact
---------------

Alarm API will be extended as follows;

* Add "notification" type into alarm type list
* Add "resource_id" to "alarm"
* Add "notification_rule" to "alarm"

Sample data of Notification-type alarm::

  {
      "alarm_actions": [
          "http://site:8000/alarm"
      ],
      "alarm_id": null,
      "description": "An alarm",
      "enabled": true,
      "insufficient_data_actions": [
          "http://site:8000/nodata"
      ],
      "name": "InstanceStatusAlarm",
      "notification_rule": {
          "event_type": "compute.instance.update",
          "query" : [
              {
                  "field" : "traits.state",
                  "type" : "string",
                  "value" : "error",
                  "op" : "eq",
              },
          ]
      },
      "ok_actions": [],
      "project_id": "c96c887c216949acbdfbd8b494863567",
      "repeat_actions": false,
      "resource_id": "153462d0-a9b8-4b5b-8175-9e4b05e9b856",
      "severity": "moderate",
      "state": "ok",
      "state_timestamp": "2015-04-03T17:49:38.406845",
      "timestamp": "2015-04-03T17:49:38.406839",
      "type": "notification",
      "user_id": "c96c887c216949acbdfbd8b494863567"
  }

"resource_id" will be refered to query alarm and will not be check permission
and belonging of project.

Security impact
---------------

None

Pipeline impact
---------------

None

Other end user impact
---------------------

None

Performance/Scalability Impacts
-------------------------------

When Ceilomter received a number of events from other OpenStack services in
short period, this alarm evaluator can keep working since events are queued in
a messaging queue system, but it can cause delay of alarm notification to users
and increase the number of read and write access to alarm database.

"resource_id" can be optional, but restricting it to mandatory could be reduce
performance impact. If user create "notification" alarm without "resource_id",
those alarms will be evaluated every time event occurred in the project.
That may lead new evaluator heavy.

Other deployer impact
---------------------

New service process have to be run.

Developer impact
----------------

Developers should be aware that events could be notified to end users and avoid
passing raw infra information to end users, while defining events and traits.

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  r-mibu

Other contributors:
  None

Ongoing maintainer:
  None

Work Items
----------

* New event-driven alarm evaluator

* Add new alarm type "notification" as well as AlarmNotificationRule

* Add "resource_id" to Alarm model

* Modify existing alarm evaluator to filter out "notification" alarms

* Add new config parameter for alarm request check whether accepting alarms
  without specifying "resource_id" or not

Future lifecycle
================

This proposal is key feature to provide information of cloud resources to end
users in real-time that enables efficient integration with user-side manager
or Orchestrator, whereas currently those information are considered to be
consumed by admin side tool or service.
Based on this change, we will seek orchestrating scenarios including fault
recovery and add useful event definition as well as additional traits.

Dependencies
============

None

Testing
=======

New unit/scenario tests are required for this change.

Documentation Impact
====================

* Proposed evaluator will be described in the developer document.

* New alarm type and how to use will be explained in user guide.

References
==========

* OPNFV Doctor project: https://wiki.opnfv.org/doctor

* Blueprint "Alarm type based on notification":
  https://blueprints.launchpad.net/ceilometer/+spec/alarm-on-notification