summaryrefslogtreecommitdiffstats
path: root/docs/design/performance-profiler.rst
blob: f834a91507742fff9043c267fedd7312fd8c799f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
.. This work is licensed under a Creative Commons Attribution 4.0 International License.
.. http://creativecommons.org/licenses/by/4.0


====================
Performance Profiler
====================

https://goo.gl/98Osig

This blueprint proposes to create a performance profiler for doctor scenarios.

Problem Description
===================

In the verification job for notification time, we have encountered some
performance issues, such as

1. In environment deployed by APEX, it meets the criteria while in the one by
Fuel, the performance is much more poor.
2. Signification performance degradation was spotted when we increase the total
number of VMs

It takes time to dig the log and analyse the reason. People have to collect
timestamp at each checkpoints manually to find out the bottleneck. A performance
profiler will make this process automatic.

Proposed Change
===============

Current Doctor scenario covers the inspector and notifier in the whole fault
management cycle::

  start                                          end
    +       +         +        +       +          +
    |       |         |        |       |          |
    |monitor|inspector|notifier|manager|controller|
    +------>+         |        |       |          |
  occurred  +-------->+        |       |          |
    |     detected    +------->+       |          |
    |       |     identified   +-------+          |
    |       |               notified   +--------->+
    |       |                  |    processed  resolved
    |       |                  |                  |
    |       +<-----doctor----->+                  |
    |                                             |
    |                                             |
    +<---------------fault management------------>+

The notification time can be split into several parts and visualized as a
timeline::

  start                                         end
    0----5---10---15---20---25---30---35---40---45--> (x 10ms)
    +    +   +   +   +    +      +   +   +   +   +
  0-hostdown |   |   |    |      |   |   |   |   |
    +--->+   |   |   |    |      |   |   |   |   |
    |  1-raw failure |    |      |   |   |   |   |
    |    +-->+   |   |    |      |   |   |   |   |
    |    | 2-found affected      |   |   |   |   |
    |    |   +-->+   |    |      |   |   |   |   |
    |    |     3-marked host down|   |   |   |   |
    |    |       +-->+    |      |   |   |   |   |
    |    |         4-set VM error|   |   |   |   |
    |    |           +--->+      |   |   |   |   |
    |    |           |  5-notified VM error  |   |
    |    |           |    +----->|   |   |   |   |
    |    |           |    |    6-transformed event
    |    |           |    |      +-->+   |   |   |
    |    |           |    |      | 7-evaluated event
    |    |           |    |      |   +-->+   |   |
    |    |           |    |      |     8-fired alarm
    |    |           |    |      |       +-->+   |
    |    |           |    |      |         9-received alarm
    |    |           |    |      |           +-->+
  sample | sample    |    |      |           |10-handled alarm
  monitor| inspector |nova| c/m  |    aodh   |
    |                                        |
    +<-----------------doctor--------------->+

Note: c/m = ceilometer

And a table of components sorted by time cost from most to least

+----------+---------+----------+
|Component |Time Cost|Percentage|
+==========+=========+==========+
|inspector |160ms    | 40%      |
+----------+---------+----------+
|aodh      |110ms    | 30%      |
+----------+---------+----------+
|monitor   |50ms     | 14%      |
+----------+---------+----------+
|...       |         |          |
+----------+---------+----------+
|...       |         |          |
+----------+---------+----------+

Note: data in the table is for demonstration only, not actual measurement

Timestamps can be collected from various sources

1. log files
2. trace point in code

The performance profiler will be integrated into the verification job to provide
detail result of the test. It can also be deployed independently to diagnose
performance issue in specified environment.

Working Items
=============

1. PoC with limited checkpoints
2. Integration with verification job
3. Collect timestamp at all checkpoints
4. Display the profiling result in console
5. Report the profiling result to test database
6. Independent package which can be installed to specified environment