Update the maintenance design document

JIRA: DOCTOR-125 Change-Id: Ideb1482fa026213bfe5ebc7a1da89cfed634950f Signed-off-by: Tomi Juvonen <tomi.juvonen@nokia.com>
author: Tomi Juvonen <tomi.juvonen@nokia.com> 2018-04-19 11:47:36 +0300
committer: Tomi Juvonen <tomi.juvonen@nokia.com> 2018-06-20 07:17:05 +0300
commit: 89a35669df9c1d07cb508be53856deae41bc24d6 (patch)
tree: a8afad3526d795542f4c4d3d36b508aa21f3a37a /docs/development/design/maintenance-design-guideline.rst
parent: db301edb5a06109628af0b2e3416751615f11153 (diff)
1 files changed, 302 insertions, 124 deletions
diff --git a/docs/development/design/maintenance-design-guideline.rst b/docs/development/design/maintenance-design-guideline.rst
index 93c3cf4e..47002b96 100644
--- a/docs/development/design/maintenance-design-guideline.rst
+++ b/docs/development/design/maintenance-design-guideline.rst
@@ -5,151 +5,329 @@
 Planned Maintenance Design Guideline
 ====================================
 
-.. NOTE::
-   This is spec draft of design guideline for planned maintenance.
-   JIRA ticket to track the update and collect comments: `DOCTOR-52`_.
-
-This document describes how one can implement planned maintenance by utilizing
-the `OPNFV Doctor project`_. framework and to meet the set requirements.
+This document describes how one can implement infrastructure maintenance in
+interaction with VNFM by utilizing the `OPNFV Doctor project`_ framework and to
+meet the set requirements. Document concentrates to OpenStack and VMs while
+the concept designed is generic for any payload or even different VIM. Admin
+tool should be also for controller and other cloud hardware, but that is not the
+main focus in OPNFV Doctor and should be defined better in the upstream
+implementation. Same goes for any more detailed work to be done.
 
 Problem Description
 ===================
 
-Telco application need to know when planned maintenance is going to happen in
-order to guarantee zero down time in its operation. It needs to be possible to
-make own actions to have application running on not affected resource or give
+Telco application need to know when infrastructure maintenance is going to happen
+in order to guarantee zero down time in its operation. It needs to be possible
+to make own actions to have application running on not affected resource or give
 guidance to admin actions like migration. More details are defined in
 requirement documentation: `use cases`_, `architecture`_ and `implementation`_.
-Also discussion in the OPNFV summit about `planned maintenance session`_.
 
 Guidelines
 ==========
 
-Cloud admin needs to make a notification about planned maintenance including
-all details that application needs in order to make decisions upon his affected
-service. This notification payload can be consumed by application by subscribing
-to corresponding event alarm trough alarming service like OpenStack AODH.
+Concepts used:
+
+- `event`: Notification to rabbitmq with particular event type.
+
+- `state event`: Notification to rabbitmq with particular event type including
+  payload with variable defined for state.
+
+- `project event`: Notification to rabbitmq that is meant for project. Single
+  event type is used with different payload and state information.
+
+- `admin event`: Notification to rabbitmq that is meant for admin or as for any
+  infrastructure service. Single event type is used with different state
+  information.
+
+- `rolling maintenance`: Node by Node rolling maintenance and upgrade where
+  a single node at a time will be maintained after a possible application
+  payload is moved away from the node.
+
+- `project` stands for `application` in OpenStack contents and both are used in
+  this document. `tenant` is many times used for the same.
+
+Infrastructure admin needs to make notification with two different event types.
+One is meant for admin and one for project. Notification payload can be consumed
+by application and admin by subscribing to corresponding event alarm trough
+alarming service like OpenStack AODH.
+
+- Infrastructure admin needs to make a notification about infrastructure
+  maintenance including all details that application needs in order to make
+  a decisions upon his affected service. Alarm Payload can hold a link to
+  infrastructure admin tool API for reply and for other possible information.
+  There is many steps of communication between admin tool and application, thus
+  the payload needed for the information passed is very similar. Because of
+  this, the same event type can be used, but there can be a variable like
+  `state` to tell application what is needed as action for each event.
+  If a project have not subscribed to alarm, admin tool responsible for the
+  maintenance will assume it can do maintenance operations without interaction
+  with application on top of it.
+
+- Infrastructure admin needs to make an event about infrastructure maintenance
+  telling when the maintenance starts and another when it ends. This admin level
+  event should include the host name. This could be consumed by any admin level
+  infrastructure entity. In this document we consume this in `Inspector` that
+  is in `OPNFV Doctor project`_ terms infrastructure entity responsible for
+  automatic host fault management. Automated actions surely needs to be disabled
+  during planned maintenance.
 
 Before maintenance starts application needs to be able to make switch over for
 his ACT-STBY service affected, do operation to move service to not effected part
-of infra or give a hint for admin operation like migration that can be
+of infrastructure or give a hint for admin operation like migration that can be
 automatically issued by admin tool according to agreed policy.
 
-Flow diagram::
-
-  admin alarming project  controller  inspector
-    |   service  app manager   |           |
-    |  1.   |         |        |           |
-    +------------------------->+           |
-    +<-------------------------+           |
-    |  2.   |         |        |           |
-    +------>+    3.   |        |           |
-    |       +-------->+   4.   |           |
-    |       |         +------->+           |
-    |       |    5.   +<-------+           |
-    +<----------------+        |           |
-    |                 |   6.   |           |
-    +------------------------->+           |
-    +<-------------------------+     7.    |
-    +------------------------------------->+
-    |   8.  |         |        |           |
-    +------>+    9.   |        |           |
-    |       +-------->+        |           |
-    +--------------------------------------+
-    |                10.                   |
-    +--------------------------------------+
-    |  11.  |         |        |           |
-    +------------------------->+           |
-    +<-------------------------+           |
-    |  12.  |         |        |           |
-    +------>+-------->+        |    13.    |
-    +------------------------------------->+
-    +-------+---------+--------+-----------+
-
-Concepts used below:
-
-- `full maintenance`: This means maintenance will take a longer time and
-  resource should be emptied, meaning container or VM need to be moved or
-  deleted. Admin might need to test resource to work after maintenance.
-
-- `reboot`: Only a reboot is needed and admin does not need separate testing
-  after that. Container or VM can be left in place if so wanted.
-
-- `notification`: Notification to rabbitmq.
-
-Admin makes a planned maintenance session where he sets
-a `maintenance_session_id` that is a unique ID for all the hardware resources he
-is going to have the maintenance at the same time. Mostly maintenance should be
-done node by node, meaning a single compute node at a time would be in single
-planned maintenance session having unique `maintenance_session_id`. This ID will
-be carried trough the whole session in all places and can be used to query
-maintenance in admin tool API. Project running a Telco application should set
-a specific role for admin tool to know it cannot do planned maintenance unless
-project has agreed actions to be done for its VMs or containers. This means the
-project has configured itself to get alarms upon planned maintenance and it is
-capable of agreeing needed actions. Admin is supposed to use an admin tool to
-automate maintenance process partially or entirely.
-
-The flow of a successful planned maintenance session as in OpenStack example
-case:
-
-1.  Admin disables nova-compute in order to do planned maintenance on a compute
-    host and gets ACK from the API call. This action needs to be done to ensure
-    no thing will be placed in this compute host by any user. Action is always
-    done regardless the whole compute will be affected or not.
-2.  Admin sends a project specific maintenance notification with state
-    `planned maintenance`. This includes detailed information about maintenance,
-    like when it is going to start, is it `reboot` or `full maintenance`
-    including the information about project containers or VMs running on host or
-    the part of it that will need maintenance. Also default action like
-    migration will be mentioned that will be issued by admin before maintenance
-    starts if no other action is set by project. In case project has a specific
-    role set, planned maintenance cannot start unless project has agreed the
-    admin action. Available admin actions are also listed in notification.
-3.  Application manager of the project receives AODH alarm about the same.
-4.  Application manager can do switch over to his ACT-STBY service, delete and
-    re-instantiate his service on not affected resource if so wanted.
-5.  Application manager may call admin tool API to give preferred instructions
-    for leaving VMs and containers in place or do admin action to migrate them.
-    In case admin does not receive this instruction before maintenance is to
-    start it will do the pre-configured default action like migration to
-    projects without a specific role to say project need to agree the action.
-    VMs or Containers can be left on host if type of maintenance is just `reboot`.
-6.  Admin does possible actions to VMs and containers and receives an ACK.
-7.  In case everything went ok, Admin sends admin type of maintenance
-    notification with state `in maintenance`. This notification can be consumed
-    by Inspector and other cloud services to know there is ongoing maintenance
-    which means things like automatic fault management actions for the hardware
-    resources should be disabled.
-8.  If maintenance type is `reboot` and project is still having containers or
-    VMs running on affected hardware resource, Admin sends project specific
-    maintenance notification with state updated to `in maintenance`. If project
-    do not have anything left running on affected hardware resource, state will
-    be `maintenance over` instead. If maintenance can not be performed for some
-    reason state should be `maintenance cancelled`. In this case last operation
-    remaining for admin is to re-enable nova-compute service, ensure
-    everything is running and not to proceed any further steps.
-9.  Application manager of the project receives AODH alarm about the same.
-10. Admin will do the maintenance. This is out of Doctor scope.
-11. Admin enables nova-compute service when maintenance is over and host can be
-    put back to production. An ACK is received from API call.
-12. In case project had left containers or VMs on hardware resource over
-    maintenance, Admin sends project specific maintenance notification with
-    state updated to `maintenance over`.
-13. Admin sends admin type of maintenance notification with state updated to
-    `maintenance over`. Inspector and other
-    cloud services can consume this to know hardware resource is back in use.
+There should be at least one empty host compatible to host under maintenance in
+order to have a smooth `rolling maintenance` done. For this to be possible also
+down scaling the application instances should be possible.
+
+Infrastructure admin should have a tool that is responsible for hosting a
+maintenance work flow session with needed APIs for admin and for applications.
+The Group of hosts in single maintenance session should always have the same
+physical capabilities, so the rolling maintenance can be guaranteed.
+
+Flow diagram is meant to be as high level as possible. It currently does not try
+to be perfect, but to show the most important interfaces needed between VNFM and
+infrastructure admin. This can be seen e.g. as missing error handling that can
+be defined later on.
+
+Flow diagram:
+
+.. figure:: images/maintenance-workflow.png
+   :alt: Work flow in OpenStack
+
+Flow diagram step by step:
+
+- Infrastructure admin makes a maintenance session to maintain and upgrade
+  certain group of hardware. At least compute hardware in single session should
+  be having same capabilities like the amount number of VCPUs to ensure
+  the maintenance can be done node by node in rolling fashion. Maintenance
+  session need to have a `session_id` that is a unique ID to be carried
+  throughout all events and can be used in APIs needed when interacting with
+  the session. Maintenance session needs to have knowledge about when
+  maintenance will start and what capabilities the possible upgrade to
+  infrastructure will bring to application payload on top of it. It will be
+  matter of the implementation to define in more detail whether some more data is
+  needed when creating a session or if it is defined in the admin tool
+  configuration.
+
+  There can be several parallel maintenance sessions and a single session can
+  include multiple projects payload. Typically maintenance session should include
+  similar type of compute hardware, so you can guarantee moving of instances on
+  top of them can work between the compute hosts.
+
+- State `MAINTENANCE` `project event` and reply `ACK_MAINTENANCE`. Immediately
+  after a maintenance session is created, infrastructure admin tool will send
+  a project specific 'notification' which application manager can consume by
+  subscribing to AODH alarm for this event. As explained already earlier all
+  `project event`s will only be sent in case the project subscribes to alarm and
+  otherwise the interaction with application will simply not be done and
+  operations could be forced.
+
+  The state `MAINTENANCE` event should at least include:
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `MAINTENANCE` to identify event action needed.
+    - `instance_ids` to tell project which of his instances will be affected by
+      the maintenance. This might be a link to admin tool project specific API
+      as AODH variables are limited to string of 255 character.
+    - `reply_url` for application to call admin tool project specific API to
+      answer `ACK_MAINTENANCE` including the `session_id`.
+    - `project_id` to identify project.
+    - `actions_at` time stamp to indicate when maintenance work flow will start.
+      `ACK_MAINTENANCE` reply is needed before that time.
+    - `metadata` to include key values pairs of a capabilities coming over the
+      maintenance operation like 'openstack_version': 'Queens'
+
+- Optional state `DOWN_SCALE` `project event` and reply `ACK_DOWN_SCALE`. When it
+  is time to start the maintenance work flow as the time reaches the `actions_at`
+  defined in previous `state event`, admin tool needs to check if there is already
+  an empty compute host needed by the `rolling maintenance`. In case there is no
+  empty host, admin tool can ask application to down scale by sending project
+  specific `DOWN_SCALE` `state event`.
+
+  The state `DOWN_SCALE` event should at least include:
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `DOWN_SCALE` to identify event action needed.
+    - `reply_url` for application to call admin tool project specific API to
+      answer `ACK_DOWN_SCALE` including the `session_id`.
+    - `project_id` to identify project.
+    - `actions_at` time stamp to indicate when is the last moment to send
+      `ACK_DOWN_SCALE`. This means application can have time to finish some
+      ongoing transactions before down scaling his instances. This guarantees
+      a zero downtime for his service.
+
+- Optional state `PREPARE_MAINTENANCE` `project event` and reply
+  `ACK_PREPARE_MAINTENANCE`. In case still after down scaling the applications
+  there is still no empty compute host, admin tools needs to analyze the
+  situation on compute host under maintenance. It needs to choose compute node
+  that is now almost empty or has otherwise least critical instances running if
+  possible, like looking if there is floating IPs. When compute host is chosen,
+  a `PREPARE_MAINTENANCE` `state event` can be sent to projects having instances
+  running on this host to migrate them to other compute hosts. It might also be
+  possible to have another round of `DOWN_SCALE` `state event` if necessary, but
+  this is not proposed here.
+
+  The state `PREPARE_MAINTENANCE` event should at least include:
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `PREPARE_MAINTENANCE` to identify event action needed.
+    - `instance_ids` to tell project which of his instances will be affected by
+      the `state event`. This might be a link to admin tool project specific API
+      as AODH variables are limited to string of 255 character.
+    - `reply_url` for application to call admin tool project specific API to
+      answer `ACK_PREPARE_MAINTENANCE` including the `session_id` and
+      `instance_ids` with list of key value pairs with key as `instance_id` and
+      chosen action from allowed actions given via `allowed_actions` as value.
+    - `project_id` to identify project.
+    - `actions_at` time stamp to indicate when is the last moment to send
+      `ACK_PREPARE_MAINTENANCE`. This means application can have time to finish
+      some ongoing transactions within his instances and make possible
+      switch over. This guarantees a zero downtime for his service.
+    - `allowed_actions` to tell what admin tool supports as action to move
+      instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE']`
+
+- Optional state `INSTANCE_ACTION_DONE` `project event`. In case admin tool needed
+  to make action to move instance like migrating it to another compute host, this
+  `state event` will be sent to tell the operation is complete.
+
+  The state `INSTANCE_ACTION_DONE` event should at least include:
+
+    - `session_id` to reference correct maintenance session.
+    - `instance_ids` to tell project which of his instance had the admin action
+      done.
+    - `project_id` to identify project.
+
+- At this state it is guaranteed there is an empty compute host. It would be
+  maintained first trough `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps, but
+  following the flow chart `PLANNED_MAINTENANCE` will be explained next.
+
+- Optional state `PLANNED_MAINTENANCE` `project event` and reply
+  `ACK_PLANNED_MAINTENANCE`. In case compute host to be maintained has
+  instances, projects owning those should have this `state event`. When project
+  receives this `state event` it knows instances moved to other compute host as
+  resulting actions will now go to host that is already maintained. This means
+  it might have new capabilities that project can take into use. This gives the
+  project the possibility to upgrade his instances also to support new
+  capabilities over the action chosen to move instances.
+
+  The state `PLANNED_MAINTENANCE` event should at least include:
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `PLANNED_MAINTENANCE` to identify event action needed.
+    - `instance_ids` to tell project which of his instances will be affected by
+      the event. This might be a link to admin tool project specific API as AODH
+      variables are limited to string of 255 character.
+    - `reply_url` for application to call admin tool project specific API to
+      answer `ACK_PLANNED_MAINTENANCE` including the `session_id` and
+      `instance_ids` with list of key value pairs with key as `instance_id` and
+      chosen action from allowed actions given via `allowed_actions` as value.
+    - `project_id` to identify project.
+    - `actions_at` time stamp to indicate when is the last moment to send
+      `ACK_PLANNED_MAINTENANCE`. This means application can have time to finish
+      some ongoing transactions within his instances and make possible switch
+      over. This guarantees a zero downtime for his service.
+    - `allowed_actions` to tell what admin tool supports as action to move
+      instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE', 'OWN_ACTION']`
+      `OWN_ACTION` means that application may want to re-instantiate his
+      instance perhaps to take into use the new capability coming over the
+      infrastructure maintenance. Re-instantiated instance will go to already
+      maintained host having the new capability.
+    - `metadata` to include key values pairs of a capabilities coming over the
+      maintenance operation like 'openstack_version': 'Queens'
+
+- `State IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` `admin event`s. Just before
+  host goes to maintenance the IN_MAINTENANCE` `state event` will be send to
+  indicate host is entering to maintenance. Host is then taken out of production
+  and can be powered off, replaced, or rebooted during the operation.
+  During the maintenance and upgrade host might be moved to admin's own host
+  aggregate, so it can be tested to work before putting back to production.
+  After maintenance is complete `MAINTENANCE_COMPLETE` `state event` will be sent
+  to know host is back in use. Adding or removing of a host is yet not
+  included in this concept, but can be addressed later.
+
+  The state `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` event should at least
+  include:
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `IN_MAINTENANCE` or `MAINTENANCE_COMPLETE` to indicate host
+      state.
+    - `project_id` to identify admin project needed by AODH alarm.
+    - `host` to indicate the host name.
+
+- State `MAINTENANCE_COMPLETE` `project event` and reply
+  `MAINTENANCE_COMPLETE_ACK`. After all compute nodes in the maintenance session
+  have gone trough maintenance operation this `state event` can be send to all
+  projects that had instances running on any of those nodes. If there was a down
+  scale done, now the application could up scale back to full operation.
+
+    - `session_id` to reference correct maintenance session.
+    - `state` as `MAINTENANCE_COMPLETE` to identify event action needed.
+    - `instance_ids` to tell project which of his instances are currently
+      running on hosts maintained in this maintenance session. This might be a
+      link to admin tool project specific API as AODH variables are limited to
+      string of 255 character.
+    - `reply_url` for application to call admin tool project specific API to
+      answer `ACK_MAINTENANCE` including the `session_id`.
+    - `project_id` to identify project.
+    - `actions_at` time stamp to indicate when maintenance work flow will start.
+    - `metadata` to include key values pairs of a capabilities coming over the
+      maintenance operation like 'openstack_version': 'Queens'
+
+- At the end admin tool maintenance session can enter to `MAINTENANCE_COMPLETE`
+  state and session can be removed.
+
+Benefits
+========
+
+- Application is guaranteed zero downtime as it is aware of the maintenance
+  action affecting its payload. The application is made aware of the maintenance
+  time window to make sure it can prepare for it.
+- Application gets to know new capabilities over infrastructure maintenance and
+  upgrade and can utilize those (like do its own upgrade)
+- Any application supporting the interaction being defined could be running on
+  top of the same infrastructure provider. No vendor lock-in for application.
+- Any infrastructure component can be aware of host(s) under maintenance via
+  `admin event`s about host state. No vendor lock-in for infrastructure
+  components.
+- Generic messaging making it possible to use same concept in different type of
+  clouds and application payloads. `instance_ids` will uniquely identify any
+  type of instance and similar notification payload can be used regardless we
+  are in OpenStack. Work flow just need to support different cloud
+  infrastructure management to support different cloud.
+- No additional hardware is needed during maintenance operations as down- and
+  up-scaling can be supported for the applications. Optional, if no extensive
+  spare capacity is available for the maintenance - as typically the case in
+  Telco environments.
+- Parallel maintenance sessions for different group of hardware. Same session
+  should include hardware with same capabilities to guarantee `rolling
+  maintenance` actions.
+- Multi-tenancy support. Project specific messaging about maintenance.
+
+Future considerations
+=====================
+
+- Pluggable architecture for infrastructure admin tool to handle different
+  clouds and payloads.
+- Pluggable architecture to handle specific maintenance/upgrade cases like
+  OpenStack upgrade between specific versions or admin testing before giving
+  host back to production.
+- Support for user specific details need to be taken into account in admin side
+  actions (e.g. run a script, ...).
+- (Re-)Use existing implementations like Mistral for work flows.
+- Scaling hardware resources. Allow critical application to be scaled at the
+  same time in controlled fashion or retire application.
 
 POC
 ---
 
-There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing
-summit to show the basic concept of using framework defined by the project.
+There was a `Maintenance POC`_ demo 'How to gain VNF zero down-time during
+Infrastructure Maintenance and Upgrade' in the OCP and ONS summit March 2018.
+Similar concept is also being made as `OPNFV Doctor project`_ new test case
+scenario.
 
-.. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52
 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
 .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance
 .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance
 .. _implementation:  http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance
-.. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html
-.. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2
+.. _Maintenance POC: https://youtu.be/7q496Tutzlo
author	Tomi Juvonen <tomi.juvonen@nokia.com>	2018-04-19 11:47:36 +0300
committer	Tomi Juvonen <tomi.juvonen@nokia.com>	2018-06-20 07:17:05 +0300
commit	89a35669df9c1d07cb508be53856deae41bc24d6 (patch)
tree	a8afad3526d795542f4c4d3d36b508aa21f3a37a /docs/development/design/maintenance-design-guideline.rst
parent	db301edb5a06109628af0b2e3416751615f11153 (diff)