summaryrefslogtreecommitdiffstats
path: root/design_docs/report-host-fault-to-update-server-state-immediately.rst
blob: 0ee02064aa00e0eac506e73cd330cacc7aa258cb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
====================================================
Report host fault to update server state immediately
====================================================

https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately

A new API is needed to report a host fault to change the state of the
instances and compute node immediately. This allows usage of evacuate API
without a delay. The new API provides the possibility for external monitoring
system to detect any kind of host failure fast and reliably and inform
OpenStack about it. Nova updates the compute node state and states of the
instances. This way the states in the Nova DB will be in sync with the
real state of the system.

Problem description
===================
* Nova state change for failed or unreachable host is slow and does not
  reliably state compute node is down or not. This might cause same instance
  to run twice if action taken to evacuate instance to another host.
* Nova state for instances on failed compute node will not change,
  but remains active and running. This gives user a false information about
  instance state. Currently one would need to call "nova reset-state" for each
  instance to have them in error state.
* OpenStack user cannot make HA actions fast and reliably by trusting instance
  state and compute node state.
* As compute node state changes slowly one cannot evacuate instances.

Use Cases
---------
Use case in general is that in case there is a host fault one should change
compute node state fast and reliably when using DB servicegroup backend.
On top of this here is the use cases that are not covered currently to have
instance states changed correctly:
* Management network connectivity lost between controller and compute node.
* Host HW failed.

Generic use case flow:

* The external monitoring system detects a host fault.
* The external monitoring system fences the host if not down already.
* The external system calls the new Nova API to force the failed compute node
  into down state as well as instances running on it.
* Nova updates the compute node state and state of the effected instances to
  Nova DB.

Currently nova-compute state will be changing "down", but it takes a long
time. Server state keeps as "vm_state: active" and "power_state:
running", which is not correct. By having external tool to detect host faults
fast, fence host by powering down and then report host down to OpenStack, all
these states would reflect to actual situation. Also if OpenStack will not
implement automatic actions for fault correlation, external tool can do that.
This could be configured for example in server instance METADATA easily and be
read by external tool.

Project Priority
-----------------
Liberty priorities have not yet been defined.

Proposed change
===============
There needs to be a new API for Admin to state host is down. This API is used
to mark compute node and instances running on it down to reflect the real
situation.

Example on compute node is:

* When compute node is up and running:
  vm_state: active and power_state: running
  nova-compute state: up status: enabled
* When compute node goes down and new API is called to state host is down:
  vm_state: stopped power_state: shutdown
  nova-compute state: down status: enabled

vm_state values: soft-delete, deleted, resized and error
should not be touched.
task_state effect needs to be worked out if needs to be touched.

Alternatives
------------
There is no attractive alternatives to detect all different host faults than
to have a external tool to detect different host faults. For this kind of tool
to exist there needs to be new API in Nova to report fault. Currently there
must have been some kind of workarounds implemented as cannot trust or get the
states from OpenStack fast enough.

Data model impact
-----------------
None

REST API impact
---------------
* Update CLI to report host is down

  nova host-update command

  usage: nova host-update [--status <enable|disable>]
                        [--maintenance <enable|disable>]
                        [--report-host-down]
                        <hostname>

  Update host settings.

  Positional arguments

  <hostname>
  Name of host.

  Optional arguments

  --status <enable|disable>
  Either enable or disable a host.

  --maintenance <enable|disable>
  Either put or resume host to/from maintenance.

  --down
  Report host down to update instance and compute node state in db.

* Update Compute API to report host is down:

  /v2.1/{tenant_id}/os-hosts/{host_name}

  Normal response codes: 200
  Request parameters

  Parameter     Style   Type          Description
  host_name     URI     xsd:string    The name of the host of interest to you.

  {
      "host": {
          "status": "enable",
          "maintenance_mode": "enable"
          "host_down_reported": "true"

      }

  }

  {
      "host": {
          "host": "65c5d5b7e3bd44308e67fc50f362aee6",
          "maintenance_mode": "enabled",
          "status": "enabled"
          "host_down_reported": "true"

      }

  }

* New method to nova.compute.api module HostAPI class to have a
  to mark host related instances and compute node down:
  set_host_down(context, host_name)

* class novaclient.v2.hosts.HostManager(api) method update(host, values)
  Needs to handle reporting host down.

* Schema does not need changes as in db only service and server states are to
  be changed.

Security impact
---------------
API call needs admin privileges (in the default policy configuration).

Notifications impact
--------------------
None

Other end user impact
---------------------
None

Performance Impact
------------------
Only impact is that user can get information faster about instance and
compute node state. This also gives possibility to evacuate faster.
No impact that would slow down. Host down should be rare occurrence.

Other deployer impact
---------------------
Developer can make use of any external tool to detect host fault and report it
to OpenStack.

Developer impact
----------------
None

Implementation
==============
Assignee(s)
-----------
Primary assignee:   Tomi Juvonen
Other contributors: Ryota Mibu

Work Items
----------
* Test cases.
* API changes.
* Documentation.

Dependencies
============
None

Testing
=======
Test cases that exists for enabling or putting host to maintenance should be
altered or similar new cases made test new functionality.

Documentation Impact
====================

New API needs to be documented:

* Compute API extensions documentation.
  http://developer.openstack.org/api-ref-compute-v2.1.html
* Nova commands documentation.
  http://docs.openstack.org/user-guide-admin/content/novaclient_commands.html
* Compute command-line client documentation.
  http://docs.openstack.org/cli-reference/content/novaclient_commands.html
* nova.compute.api documentation.
  http://docs.openstack.org/developer/nova/api/nova.compute.api.html
* High Availability guide might have page to tell external tool could provide
  ability to provide faster HA as able to update states by new API.
  http://docs.openstack.org/high-availability-guide/content/index.html

References
==========
* OPNFV Doctor project: https://wiki.opnfv.org/doctor
* OpenStack Instance HA Proposal:
  http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/
* The Different Facets of OpenStack HA:
  http://blog.russellbryant.net/2015/03/10/
  the-different-facets-of-openstack-ha/