From 89a35669df9c1d07cb508be53856deae41bc24d6 Mon Sep 17 00:00:00 2001 From: Tomi Juvonen Date: Thu, 19 Apr 2018 11:47:36 +0300 Subject: Update the maintenance design document JIRA: DOCTOR-125 Change-Id: Ideb1482fa026213bfe5ebc7a1da89cfed634950f Signed-off-by: Tomi Juvonen --- .../design/images/maintenance-workflow.png | Bin 0 -> 81286 bytes .../design/maintenance-design-guideline.rst | 426 +++++++++++++++------ 2 files changed, 302 insertions(+), 124 deletions(-) create mode 100644 docs/development/design/images/maintenance-workflow.png diff --git a/docs/development/design/images/maintenance-workflow.png b/docs/development/design/images/maintenance-workflow.png new file mode 100644 index 00000000..9b65fd59 Binary files /dev/null and b/docs/development/design/images/maintenance-workflow.png differ diff --git a/docs/development/design/maintenance-design-guideline.rst b/docs/development/design/maintenance-design-guideline.rst index 93c3cf4e..47002b96 100644 --- a/docs/development/design/maintenance-design-guideline.rst +++ b/docs/development/design/maintenance-design-guideline.rst @@ -5,151 +5,329 @@ Planned Maintenance Design Guideline ==================================== -.. NOTE:: - This is spec draft of design guideline for planned maintenance. - JIRA ticket to track the update and collect comments: `DOCTOR-52`_. - -This document describes how one can implement planned maintenance by utilizing -the `OPNFV Doctor project`_. framework and to meet the set requirements. +This document describes how one can implement infrastructure maintenance in +interaction with VNFM by utilizing the `OPNFV Doctor project`_ framework and to +meet the set requirements. Document concentrates to OpenStack and VMs while +the concept designed is generic for any payload or even different VIM. Admin +tool should be also for controller and other cloud hardware, but that is not the +main focus in OPNFV Doctor and should be defined better in the upstream +implementation. Same goes for any more detailed work to be done. Problem Description =================== -Telco application need to know when planned maintenance is going to happen in -order to guarantee zero down time in its operation. It needs to be possible to -make own actions to have application running on not affected resource or give +Telco application need to know when infrastructure maintenance is going to happen +in order to guarantee zero down time in its operation. It needs to be possible +to make own actions to have application running on not affected resource or give guidance to admin actions like migration. More details are defined in requirement documentation: `use cases`_, `architecture`_ and `implementation`_. -Also discussion in the OPNFV summit about `planned maintenance session`_. Guidelines ========== -Cloud admin needs to make a notification about planned maintenance including -all details that application needs in order to make decisions upon his affected -service. This notification payload can be consumed by application by subscribing -to corresponding event alarm trough alarming service like OpenStack AODH. +Concepts used: + +- `event`: Notification to rabbitmq with particular event type. + +- `state event`: Notification to rabbitmq with particular event type including + payload with variable defined for state. + +- `project event`: Notification to rabbitmq that is meant for project. Single + event type is used with different payload and state information. + +- `admin event`: Notification to rabbitmq that is meant for admin or as for any + infrastructure service. Single event type is used with different state + information. + +- `rolling maintenance`: Node by Node rolling maintenance and upgrade where + a single node at a time will be maintained after a possible application + payload is moved away from the node. + +- `project` stands for `application` in OpenStack contents and both are used in + this document. `tenant` is many times used for the same. + +Infrastructure admin needs to make notification with two different event types. +One is meant for admin and one for project. Notification payload can be consumed +by application and admin by subscribing to corresponding event alarm trough +alarming service like OpenStack AODH. + +- Infrastructure admin needs to make a notification about infrastructure + maintenance including all details that application needs in order to make + a decisions upon his affected service. Alarm Payload can hold a link to + infrastructure admin tool API for reply and for other possible information. + There is many steps of communication between admin tool and application, thus + the payload needed for the information passed is very similar. Because of + this, the same event type can be used, but there can be a variable like + `state` to tell application what is needed as action for each event. + If a project have not subscribed to alarm, admin tool responsible for the + maintenance will assume it can do maintenance operations without interaction + with application on top of it. + +- Infrastructure admin needs to make an event about infrastructure maintenance + telling when the maintenance starts and another when it ends. This admin level + event should include the host name. This could be consumed by any admin level + infrastructure entity. In this document we consume this in `Inspector` that + is in `OPNFV Doctor project`_ terms infrastructure entity responsible for + automatic host fault management. Automated actions surely needs to be disabled + during planned maintenance. Before maintenance starts application needs to be able to make switch over for his ACT-STBY service affected, do operation to move service to not effected part -of infra or give a hint for admin operation like migration that can be +of infrastructure or give a hint for admin operation like migration that can be automatically issued by admin tool according to agreed policy. -Flow diagram:: - - admin alarming project controller inspector - | service app manager | | - | 1. | | | | - +------------------------->+ | - +<-------------------------+ | - | 2. | | | | - +------>+ 3. | | | - | +-------->+ 4. | | - | | +------->+ | - | | 5. +<-------+ | - +<----------------+ | | - | | 6. | | - +------------------------->+ | - +<-------------------------+ 7. | - +------------------------------------->+ - | 8. | | | | - +------>+ 9. | | | - | +-------->+ | | - +--------------------------------------+ - | 10. | - +--------------------------------------+ - | 11. | | | | - +------------------------->+ | - +<-------------------------+ | - | 12. | | | | - +------>+-------->+ | 13. | - +------------------------------------->+ - +-------+---------+--------+-----------+ - -Concepts used below: - -- `full maintenance`: This means maintenance will take a longer time and - resource should be emptied, meaning container or VM need to be moved or - deleted. Admin might need to test resource to work after maintenance. - -- `reboot`: Only a reboot is needed and admin does not need separate testing - after that. Container or VM can be left in place if so wanted. - -- `notification`: Notification to rabbitmq. - -Admin makes a planned maintenance session where he sets -a `maintenance_session_id` that is a unique ID for all the hardware resources he -is going to have the maintenance at the same time. Mostly maintenance should be -done node by node, meaning a single compute node at a time would be in single -planned maintenance session having unique `maintenance_session_id`. This ID will -be carried trough the whole session in all places and can be used to query -maintenance in admin tool API. Project running a Telco application should set -a specific role for admin tool to know it cannot do planned maintenance unless -project has agreed actions to be done for its VMs or containers. This means the -project has configured itself to get alarms upon planned maintenance and it is -capable of agreeing needed actions. Admin is supposed to use an admin tool to -automate maintenance process partially or entirely. - -The flow of a successful planned maintenance session as in OpenStack example -case: - -1. Admin disables nova-compute in order to do planned maintenance on a compute - host and gets ACK from the API call. This action needs to be done to ensure - no thing will be placed in this compute host by any user. Action is always - done regardless the whole compute will be affected or not. -2. Admin sends a project specific maintenance notification with state - `planned maintenance`. This includes detailed information about maintenance, - like when it is going to start, is it `reboot` or `full maintenance` - including the information about project containers or VMs running on host or - the part of it that will need maintenance. Also default action like - migration will be mentioned that will be issued by admin before maintenance - starts if no other action is set by project. In case project has a specific - role set, planned maintenance cannot start unless project has agreed the - admin action. Available admin actions are also listed in notification. -3. Application manager of the project receives AODH alarm about the same. -4. Application manager can do switch over to his ACT-STBY service, delete and - re-instantiate his service on not affected resource if so wanted. -5. Application manager may call admin tool API to give preferred instructions - for leaving VMs and containers in place or do admin action to migrate them. - In case admin does not receive this instruction before maintenance is to - start it will do the pre-configured default action like migration to - projects without a specific role to say project need to agree the action. - VMs or Containers can be left on host if type of maintenance is just `reboot`. -6. Admin does possible actions to VMs and containers and receives an ACK. -7. In case everything went ok, Admin sends admin type of maintenance - notification with state `in maintenance`. This notification can be consumed - by Inspector and other cloud services to know there is ongoing maintenance - which means things like automatic fault management actions for the hardware - resources should be disabled. -8. If maintenance type is `reboot` and project is still having containers or - VMs running on affected hardware resource, Admin sends project specific - maintenance notification with state updated to `in maintenance`. If project - do not have anything left running on affected hardware resource, state will - be `maintenance over` instead. If maintenance can not be performed for some - reason state should be `maintenance cancelled`. In this case last operation - remaining for admin is to re-enable nova-compute service, ensure - everything is running and not to proceed any further steps. -9. Application manager of the project receives AODH alarm about the same. -10. Admin will do the maintenance. This is out of Doctor scope. -11. Admin enables nova-compute service when maintenance is over and host can be - put back to production. An ACK is received from API call. -12. In case project had left containers or VMs on hardware resource over - maintenance, Admin sends project specific maintenance notification with - state updated to `maintenance over`. -13. Admin sends admin type of maintenance notification with state updated to - `maintenance over`. Inspector and other - cloud services can consume this to know hardware resource is back in use. +There should be at least one empty host compatible to host under maintenance in +order to have a smooth `rolling maintenance` done. For this to be possible also +down scaling the application instances should be possible. + +Infrastructure admin should have a tool that is responsible for hosting a +maintenance work flow session with needed APIs for admin and for applications. +The Group of hosts in single maintenance session should always have the same +physical capabilities, so the rolling maintenance can be guaranteed. + +Flow diagram is meant to be as high level as possible. It currently does not try +to be perfect, but to show the most important interfaces needed between VNFM and +infrastructure admin. This can be seen e.g. as missing error handling that can +be defined later on. + +Flow diagram: + +.. figure:: images/maintenance-workflow.png + :alt: Work flow in OpenStack + +Flow diagram step by step: + +- Infrastructure admin makes a maintenance session to maintain and upgrade + certain group of hardware. At least compute hardware in single session should + be having same capabilities like the amount number of VCPUs to ensure + the maintenance can be done node by node in rolling fashion. Maintenance + session need to have a `session_id` that is a unique ID to be carried + throughout all events and can be used in APIs needed when interacting with + the session. Maintenance session needs to have knowledge about when + maintenance will start and what capabilities the possible upgrade to + infrastructure will bring to application payload on top of it. It will be + matter of the implementation to define in more detail whether some more data is + needed when creating a session or if it is defined in the admin tool + configuration. + + There can be several parallel maintenance sessions and a single session can + include multiple projects payload. Typically maintenance session should include + similar type of compute hardware, so you can guarantee moving of instances on + top of them can work between the compute hosts. + +- State `MAINTENANCE` `project event` and reply `ACK_MAINTENANCE`. Immediately + after a maintenance session is created, infrastructure admin tool will send + a project specific 'notification' which application manager can consume by + subscribing to AODH alarm for this event. As explained already earlier all + `project event`s will only be sent in case the project subscribes to alarm and + otherwise the interaction with application will simply not be done and + operations could be forced. + + The state `MAINTENANCE` event should at least include: + + - `session_id` to reference correct maintenance session. + - `state` as `MAINTENANCE` to identify event action needed. + - `instance_ids` to tell project which of his instances will be affected by + the maintenance. This might be a link to admin tool project specific API + as AODH variables are limited to string of 255 character. + - `reply_url` for application to call admin tool project specific API to + answer `ACK_MAINTENANCE` including the `session_id`. + - `project_id` to identify project. + - `actions_at` time stamp to indicate when maintenance work flow will start. + `ACK_MAINTENANCE` reply is needed before that time. + - `metadata` to include key values pairs of a capabilities coming over the + maintenance operation like 'openstack_version': 'Queens' + +- Optional state `DOWN_SCALE` `project event` and reply `ACK_DOWN_SCALE`. When it + is time to start the maintenance work flow as the time reaches the `actions_at` + defined in previous `state event`, admin tool needs to check if there is already + an empty compute host needed by the `rolling maintenance`. In case there is no + empty host, admin tool can ask application to down scale by sending project + specific `DOWN_SCALE` `state event`. + + The state `DOWN_SCALE` event should at least include: + + - `session_id` to reference correct maintenance session. + - `state` as `DOWN_SCALE` to identify event action needed. + - `reply_url` for application to call admin tool project specific API to + answer `ACK_DOWN_SCALE` including the `session_id`. + - `project_id` to identify project. + - `actions_at` time stamp to indicate when is the last moment to send + `ACK_DOWN_SCALE`. This means application can have time to finish some + ongoing transactions before down scaling his instances. This guarantees + a zero downtime for his service. + +- Optional state `PREPARE_MAINTENANCE` `project event` and reply + `ACK_PREPARE_MAINTENANCE`. In case still after down scaling the applications + there is still no empty compute host, admin tools needs to analyze the + situation on compute host under maintenance. It needs to choose compute node + that is now almost empty or has otherwise least critical instances running if + possible, like looking if there is floating IPs. When compute host is chosen, + a `PREPARE_MAINTENANCE` `state event` can be sent to projects having instances + running on this host to migrate them to other compute hosts. It might also be + possible to have another round of `DOWN_SCALE` `state event` if necessary, but + this is not proposed here. + + The state `PREPARE_MAINTENANCE` event should at least include: + + - `session_id` to reference correct maintenance session. + - `state` as `PREPARE_MAINTENANCE` to identify event action needed. + - `instance_ids` to tell project which of his instances will be affected by + the `state event`. This might be a link to admin tool project specific API + as AODH variables are limited to string of 255 character. + - `reply_url` for application to call admin tool project specific API to + answer `ACK_PREPARE_MAINTENANCE` including the `session_id` and + `instance_ids` with list of key value pairs with key as `instance_id` and + chosen action from allowed actions given via `allowed_actions` as value. + - `project_id` to identify project. + - `actions_at` time stamp to indicate when is the last moment to send + `ACK_PREPARE_MAINTENANCE`. This means application can have time to finish + some ongoing transactions within his instances and make possible + switch over. This guarantees a zero downtime for his service. + - `allowed_actions` to tell what admin tool supports as action to move + instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE']` + +- Optional state `INSTANCE_ACTION_DONE` `project event`. In case admin tool needed + to make action to move instance like migrating it to another compute host, this + `state event` will be sent to tell the operation is complete. + + The state `INSTANCE_ACTION_DONE` event should at least include: + + - `session_id` to reference correct maintenance session. + - `instance_ids` to tell project which of his instance had the admin action + done. + - `project_id` to identify project. + +- At this state it is guaranteed there is an empty compute host. It would be + maintained first trough `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` steps, but + following the flow chart `PLANNED_MAINTENANCE` will be explained next. + +- Optional state `PLANNED_MAINTENANCE` `project event` and reply + `ACK_PLANNED_MAINTENANCE`. In case compute host to be maintained has + instances, projects owning those should have this `state event`. When project + receives this `state event` it knows instances moved to other compute host as + resulting actions will now go to host that is already maintained. This means + it might have new capabilities that project can take into use. This gives the + project the possibility to upgrade his instances also to support new + capabilities over the action chosen to move instances. + + The state `PLANNED_MAINTENANCE` event should at least include: + + - `session_id` to reference correct maintenance session. + - `state` as `PLANNED_MAINTENANCE` to identify event action needed. + - `instance_ids` to tell project which of his instances will be affected by + the event. This might be a link to admin tool project specific API as AODH + variables are limited to string of 255 character. + - `reply_url` for application to call admin tool project specific API to + answer `ACK_PLANNED_MAINTENANCE` including the `session_id` and + `instance_ids` with list of key value pairs with key as `instance_id` and + chosen action from allowed actions given via `allowed_actions` as value. + - `project_id` to identify project. + - `actions_at` time stamp to indicate when is the last moment to send + `ACK_PLANNED_MAINTENANCE`. This means application can have time to finish + some ongoing transactions within his instances and make possible switch + over. This guarantees a zero downtime for his service. + - `allowed_actions` to tell what admin tool supports as action to move + instances to another compute host. Typically a list like: `['MIGRATE', 'LIVE_MIGRATE', 'OWN_ACTION']` + `OWN_ACTION` means that application may want to re-instantiate his + instance perhaps to take into use the new capability coming over the + infrastructure maintenance. Re-instantiated instance will go to already + maintained host having the new capability. + - `metadata` to include key values pairs of a capabilities coming over the + maintenance operation like 'openstack_version': 'Queens' + +- `State IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` `admin event`s. Just before + host goes to maintenance the IN_MAINTENANCE` `state event` will be send to + indicate host is entering to maintenance. Host is then taken out of production + and can be powered off, replaced, or rebooted during the operation. + During the maintenance and upgrade host might be moved to admin's own host + aggregate, so it can be tested to work before putting back to production. + After maintenance is complete `MAINTENANCE_COMPLETE` `state event` will be sent + to know host is back in use. Adding or removing of a host is yet not + included in this concept, but can be addressed later. + + The state `IN_MAINTENANCE` and `MAINTENANCE_COMPLETE` event should at least + include: + + - `session_id` to reference correct maintenance session. + - `state` as `IN_MAINTENANCE` or `MAINTENANCE_COMPLETE` to indicate host + state. + - `project_id` to identify admin project needed by AODH alarm. + - `host` to indicate the host name. + +- State `MAINTENANCE_COMPLETE` `project event` and reply + `MAINTENANCE_COMPLETE_ACK`. After all compute nodes in the maintenance session + have gone trough maintenance operation this `state event` can be send to all + projects that had instances running on any of those nodes. If there was a down + scale done, now the application could up scale back to full operation. + + - `session_id` to reference correct maintenance session. + - `state` as `MAINTENANCE_COMPLETE` to identify event action needed. + - `instance_ids` to tell project which of his instances are currently + running on hosts maintained in this maintenance session. This might be a + link to admin tool project specific API as AODH variables are limited to + string of 255 character. + - `reply_url` for application to call admin tool project specific API to + answer `ACK_MAINTENANCE` including the `session_id`. + - `project_id` to identify project. + - `actions_at` time stamp to indicate when maintenance work flow will start. + - `metadata` to include key values pairs of a capabilities coming over the + maintenance operation like 'openstack_version': 'Queens' + +- At the end admin tool maintenance session can enter to `MAINTENANCE_COMPLETE` + state and session can be removed. + +Benefits +======== + +- Application is guaranteed zero downtime as it is aware of the maintenance + action affecting its payload. The application is made aware of the maintenance + time window to make sure it can prepare for it. +- Application gets to know new capabilities over infrastructure maintenance and + upgrade and can utilize those (like do its own upgrade) +- Any application supporting the interaction being defined could be running on + top of the same infrastructure provider. No vendor lock-in for application. +- Any infrastructure component can be aware of host(s) under maintenance via + `admin event`s about host state. No vendor lock-in for infrastructure + components. +- Generic messaging making it possible to use same concept in different type of + clouds and application payloads. `instance_ids` will uniquely identify any + type of instance and similar notification payload can be used regardless we + are in OpenStack. Work flow just need to support different cloud + infrastructure management to support different cloud. +- No additional hardware is needed during maintenance operations as down- and + up-scaling can be supported for the applications. Optional, if no extensive + spare capacity is available for the maintenance - as typically the case in + Telco environments. +- Parallel maintenance sessions for different group of hardware. Same session + should include hardware with same capabilities to guarantee `rolling + maintenance` actions. +- Multi-tenancy support. Project specific messaging about maintenance. + +Future considerations +===================== + +- Pluggable architecture for infrastructure admin tool to handle different + clouds and payloads. +- Pluggable architecture to handle specific maintenance/upgrade cases like + OpenStack upgrade between specific versions or admin testing before giving + host back to production. +- Support for user specific details need to be taken into account in admin side + actions (e.g. run a script, ...). +- (Re-)Use existing implementations like Mistral for work flows. +- Scaling hardware resources. Allow critical application to be scaled at the + same time in controlled fashion or retire application. POC --- -There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing -summit to show the basic concept of using framework defined by the project. +There was a `Maintenance POC`_ demo 'How to gain VNF zero down-time during +Infrastructure Maintenance and Upgrade' in the OCP and ONS summit March 2018. +Similar concept is also being made as `OPNFV Doctor project`_ new test case +scenario. -.. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance .. _implementation: http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance -.. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html -.. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2 +.. _Maintenance POC: https://youtu.be/7q496Tutzlo -- cgit 1.2.3-korg