.. This work is licensed under a Creative Commons Attribution 4.0 International License. .. http://creativecommons.org/licenses/by/4.0 ==================================== Planned Maintenance Design Guideline ==================================== .. NOTE:: This is spec draft of design guideline for planned maintenance. JIRA ticket to track the update and collect comments: `DOCTOR-52`_. This document describes how one can implement planned maintenance by utilizing the `OPNFV Doctor project`_. framework and to meet the set requirements. Problem Description =================== Telco application need to know when planned maintenance is going to happen in order to guarantee zero down time in its operation. It needs to be possible to make own actions to have application running on not affected resource or give guidance to admin actions like migration. More details are defined in requirement documentation: `use cases`_, `architecture`_ and `implementation`_. Also discussion in the OPNFV summit about `planned maintenance session`_. Guidelines ========== Cloud admin needs to make a notification about planned maintenance including all details that application needs in order to make decisions upon his affected service. This notification payload can be consumed by application by subscribing to corresponding event alarm trough alarming service like OpenStack AODH. Before maintenance starts application needs to be able to make switch over for his ACT-STBY service affected, do operation to move service to not effected part of infra or give a hint for admin operation like migration that can be automatically issued by admin tool according to agreed policy. Flow diagram:: admin alarming project controller inspector | service app manager | | | 1. | | | | +------------------------->+ | +<-------------------------+ | | 2. | | | | +------>+ 3. | | | | +-------->+ 4. | | | | +------->+ | | | 5. +<-------+ | +<----------------+ | | | | 6. | | +------------------------->+ | +<-------------------------+ 7. | +------------------------------------->+ | 8. | | | | +------>+ 9. | | | | +-------->+ | | +--------------------------------------+ | 10. | +--------------------------------------+ | 11. | | | | +------------------------->+ | +<-------------------------+ | | 12. | | | | +------>+-------->+ | 13. | +------------------------------------->+ +-------+---------+--------+-----------+ Concepts used below: - `full maintenance`: This means maintenance will take a longer time and resource should be emptied, meaning container or VM need to be moved or deleted. Admin might need to test resource to work after maintenance. - `reboot`: Only a reboot is needed and admin does not need separate testing after that. Container or VM can be left in place if so wanted. - `notification`: Notification to rabbitmq. Admin makes a planned maintenance session where he sets a `maintenance_session_id` that is a unique ID for all the hardware resources he is going to have the maintenance at the same time. Mostly maintenance should be done node by node, meaning a single compute node at a time would be in single planned maintenance session having unique `maintenance_session_id`. This ID will be carried trough the whole session in all places and can be used to query maintenance in admin tool API. Project running a Telco application should set a specific role for admin tool to know it cannot do planned maintenance unless project has agreed actions to be done for its VMs or containers. This means the project has configured itself to get alarms upon planned maintenance and it is capable of agreeing needed actions. Admin is supposed to use an admin tool to automate maintenance process partially or entirely. The flow of a successful planned maintenance session as in OpenStack example case: 1. Admin disables nova-compute in order to do planned maintenance on a compute host and gets ACK from the API call. This action needs to be done to ensure no thing will be placed in this compute host by any user. Action is always done regardless the whole compute will be affected or not. 2. Admin sends a project specific maintenance notification with state `planned maintenance`. This includes detailed information about maintenance, like when it is going to start, is it `reboot` or `full maintenance` including the information about project containers or VMs running on host or the part of it that will need maintenance. Also default action like migration will be mentioned that will be issued by admin before maintenance starts if no other action is set by project. In case project has a specific role set, planned maintenance cannot start unless project has agreed the admin action. Available admin actions are also listed in notification. 3. Application manager of the project receives AODH alarm about the same. 4. Application manager can do switch over to his ACT-STBY service, delete and re-instantiate his service on not affected resource if so wanted. 5. Application manager may call admin tool API to give preferred instructions for leaving VMs and containers in place or do admin action to migrate them. In case admin does not receive this instruction before maintenance is to start it will do the pre-configured default action like migration to projects without a specific role to say project need to agree the action. VMs or Containers can be left on host if type of maintenance is just `reboot`. 6. Admin does possible actions to VMs and containers and receives an ACK. 7. In case everything went ok, Admin sends admin type of maintenance notification with state `in maintenance`. This notification can be consumed by Inspector and other cloud services to know there is ongoing maintenance which means things like automatic fault management actions for the hardware resources should be disabled. 8. If maintenance type is `reboot` and project is still having containers or VMs running on affected hardware resource, Admin sends project specific maintenance notification with state updated to `in maintenance`. If project do not have anything left running on affected hardware resource, state will be `maintenance over` instead. If maintenance can not be performed for some reason state should be `maintenance cancelled`. In this case last operation remaining for admin is to re-enable nova-compute service, ensure everything is running and not to proceed any further steps. 9. Application manager of the project receives AODH alarm about the same. 10. Admin will do the maintenance. This is out of Doctor scope. 11. Admin enables nova-compute service when maintenance is over and host can be put back to production. An ACK is received from API call. 12. In case project had left containers or VMs on hardware resource over maintenance, Admin sends project specific maintenance notification with state updated to `maintenance over`. 13. Admin sends admin type of maintenance notification with state updated to `maintenance over`. Inspector and other cloud services can consume this to know hardware resource is back in use. POC --- There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing summit to show the basic concept of using framework defined by the project. .. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52 .. _OPNFV Doctor project: https://wiki.opnfv.org/doctor .. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance .. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance .. _implementation: http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance .. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html .. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2