summaryrefslogtreecommitdiffstats
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/development/design/index.rst1
-rw-r--r--docs/development/design/maintenance-design-guideline.rst155
-rw-r--r--docs/development/manuals/monitors.rst36
3 files changed, 192 insertions, 0 deletions
diff --git a/docs/development/design/index.rst b/docs/development/design/index.rst
index 87d14d42..e50c1704 100644
--- a/docs/development/design/index.rst
+++ b/docs/development/design/index.rst
@@ -26,3 +26,4 @@ See also https://wiki.opnfv.org/requirements_projects .
port-data-plane-status.rst
inspector-design-guideline.rst
performance-profiler.rst
+ maintenance-design-guideline.rst
diff --git a/docs/development/design/maintenance-design-guideline.rst b/docs/development/design/maintenance-design-guideline.rst
new file mode 100644
index 00000000..93c3cf4e
--- /dev/null
+++ b/docs/development/design/maintenance-design-guideline.rst
@@ -0,0 +1,155 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+====================================
+Planned Maintenance Design Guideline
+====================================
+
+.. NOTE::
+ This is spec draft of design guideline for planned maintenance.
+ JIRA ticket to track the update and collect comments: `DOCTOR-52`_.
+
+This document describes how one can implement planned maintenance by utilizing
+the `OPNFV Doctor project`_. framework and to meet the set requirements.
+
+Problem Description
+===================
+
+Telco application need to know when planned maintenance is going to happen in
+order to guarantee zero down time in its operation. It needs to be possible to
+make own actions to have application running on not affected resource or give
+guidance to admin actions like migration. More details are defined in
+requirement documentation: `use cases`_, `architecture`_ and `implementation`_.
+Also discussion in the OPNFV summit about `planned maintenance session`_.
+
+Guidelines
+==========
+
+Cloud admin needs to make a notification about planned maintenance including
+all details that application needs in order to make decisions upon his affected
+service. This notification payload can be consumed by application by subscribing
+to corresponding event alarm trough alarming service like OpenStack AODH.
+
+Before maintenance starts application needs to be able to make switch over for
+his ACT-STBY service affected, do operation to move service to not effected part
+of infra or give a hint for admin operation like migration that can be
+automatically issued by admin tool according to agreed policy.
+
+Flow diagram::
+
+ admin alarming project controller inspector
+ | service app manager | |
+ | 1. | | | |
+ +------------------------->+ |
+ +<-------------------------+ |
+ | 2. | | | |
+ +------>+ 3. | | |
+ | +-------->+ 4. | |
+ | | +------->+ |
+ | | 5. +<-------+ |
+ +<----------------+ | |
+ | | 6. | |
+ +------------------------->+ |
+ +<-------------------------+ 7. |
+ +------------------------------------->+
+ | 8. | | | |
+ +------>+ 9. | | |
+ | +-------->+ | |
+ +--------------------------------------+
+ | 10. |
+ +--------------------------------------+
+ | 11. | | | |
+ +------------------------->+ |
+ +<-------------------------+ |
+ | 12. | | | |
+ +------>+-------->+ | 13. |
+ +------------------------------------->+
+ +-------+---------+--------+-----------+
+
+Concepts used below:
+
+- `full maintenance`: This means maintenance will take a longer time and
+ resource should be emptied, meaning container or VM need to be moved or
+ deleted. Admin might need to test resource to work after maintenance.
+
+- `reboot`: Only a reboot is needed and admin does not need separate testing
+ after that. Container or VM can be left in place if so wanted.
+
+- `notification`: Notification to rabbitmq.
+
+Admin makes a planned maintenance session where he sets
+a `maintenance_session_id` that is a unique ID for all the hardware resources he
+is going to have the maintenance at the same time. Mostly maintenance should be
+done node by node, meaning a single compute node at a time would be in single
+planned maintenance session having unique `maintenance_session_id`. This ID will
+be carried trough the whole session in all places and can be used to query
+maintenance in admin tool API. Project running a Telco application should set
+a specific role for admin tool to know it cannot do planned maintenance unless
+project has agreed actions to be done for its VMs or containers. This means the
+project has configured itself to get alarms upon planned maintenance and it is
+capable of agreeing needed actions. Admin is supposed to use an admin tool to
+automate maintenance process partially or entirely.
+
+The flow of a successful planned maintenance session as in OpenStack example
+case:
+
+1. Admin disables nova-compute in order to do planned maintenance on a compute
+ host and gets ACK from the API call. This action needs to be done to ensure
+ no thing will be placed in this compute host by any user. Action is always
+ done regardless the whole compute will be affected or not.
+2. Admin sends a project specific maintenance notification with state
+ `planned maintenance`. This includes detailed information about maintenance,
+ like when it is going to start, is it `reboot` or `full maintenance`
+ including the information about project containers or VMs running on host or
+ the part of it that will need maintenance. Also default action like
+ migration will be mentioned that will be issued by admin before maintenance
+ starts if no other action is set by project. In case project has a specific
+ role set, planned maintenance cannot start unless project has agreed the
+ admin action. Available admin actions are also listed in notification.
+3. Application manager of the project receives AODH alarm about the same.
+4. Application manager can do switch over to his ACT-STBY service, delete and
+ re-instantiate his service on not affected resource if so wanted.
+5. Application manager may call admin tool API to give preferred instructions
+ for leaving VMs and containers in place or do admin action to migrate them.
+ In case admin does not receive this instruction before maintenance is to
+ start it will do the pre-configured default action like migration to
+ projects without a specific role to say project need to agree the action.
+ VMs or Containers can be left on host if type of maintenance is just `reboot`.
+6. Admin does possible actions to VMs and containers and receives an ACK.
+7. In case everything went ok, Admin sends admin type of maintenance
+ notification with state `in maintenance`. This notification can be consumed
+ by Inspector and other cloud services to know there is ongoing maintenance
+ which means things like automatic fault management actions for the hardware
+ resources should be disabled.
+8. If maintenance type is `reboot` and project is still having containers or
+ VMs running on affected hardware resource, Admin sends project specific
+ maintenance notification with state updated to `in maintenance`. If project
+ do not have anything left running on affected hardware resource, state will
+ be `maintenance over` instead. If maintenance can not be performed for some
+ reason state should be `maintenance cancelled`. In this case last operation
+ remaining for admin is to re-enable nova-compute service, ensure
+ everything is running and not to proceed any further steps.
+9. Application manager of the project receives AODH alarm about the same.
+10. Admin will do the maintenance. This is out of Doctor scope.
+11. Admin enables nova-compute service when maintenance is over and host can be
+ put back to production. An ACK is received from API call.
+12. In case project had left containers or VMs on hardware resource over
+ maintenance, Admin sends project specific maintenance notification with
+ state updated to `maintenance over`.
+13. Admin sends admin type of maintenance notification with state updated to
+ `maintenance over`. Inspector and other
+ cloud services can consume this to know hardware resource is back in use.
+
+POC
+---
+
+There was a `Maintenance POC`_ for planned maintenance in the OPNFV Beijing
+summit to show the basic concept of using framework defined by the project.
+
+.. _DOCTOR-52: https://jira.opnfv.org/browse/DOCTOR-52
+.. _OPNFV Doctor project: https://wiki.opnfv.org/doctor
+.. _use cases: http://artifacts.opnfv.org/doctor/docs/requirements/02-use_cases.html#nvfi-maintenance
+.. _architecture: http://artifacts.opnfv.org/doctor/docs/requirements/03-architecture.html#nfvi-maintenance
+.. _implementation: http://artifacts.opnfv.org/doctor/docs/requirements/05-implementation.html#nfvi-maintenance
+.. _planned maintenance session: https://lists.opnfv.org/pipermail/opnfv-tech-discuss/2017-June/016677.html
+.. _Maintenance POC: https://wiki.opnfv.org/download/attachments/5046291/Doctor%20Maintenance%20PoC%202017.pptx?version=1&modificationDate=1498182869000&api=v2
diff --git a/docs/development/manuals/monitors.rst b/docs/development/manuals/monitors.rst
new file mode 100644
index 00000000..0d22b1de
--- /dev/null
+++ b/docs/development/manuals/monitors.rst
@@ -0,0 +1,36 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+Monitor Types and Limitations
+=============================
+
+Currently there are two monitor types supported: sample and collectd
+
+Sample Monitor
+--------------
+
+Sample monitor type pings the compute host from the control host and calculates the
+notification time after the ping timeout.
+Also if inspector type is sample, the compute node needs to communicate with the control
+node on port 12345. This port needs to be opened for incomming traffic on control node.
+
+Collectd Monitor
+----------------
+
+Collectd monitor type uses collectd daemon running ovs_events plugin. Collectd runs on
+compute to send instant notification to the control node. The notification time is
+calculated by using the difference of time at which compute node sends notification to
+control node and the time at which consumer is notified. The time on control and compute
+node has to be synchronized for this reason. For further details on setting up collectd
+on the compute node, use the following link:
+http://docs.opnfv.org/en/stable-danube/submodules/barometer/docs/release/userguide/feature.userguide.html#id18
+
+Collectd monitors an interface managed by OVS. If the interface is not be assigned
+an IP, the user has to provide the name of interface to be monitored. The command to
+launch the doctor test in that case is:
+MONITOR_TYPE=collectd INSPECTOR_TYPE=sample INTERFACE_NAME=example_iface ./run.sh
+
+If the interface name or IP is not provided, the collectd monitor type will monitor the
+default management interface. This may result in the failure of doctor run.sh test case.
+The test case sets the monitored interface down and if the inspector (sample or congress)
+is running on the same subnet, collectd monitor will not be able to communicate with it.