summaryrefslogtreecommitdiffstats
path: root/docs/development/requirements/04-gaps.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/development/requirements/04-gaps.rst')
-rw-r--r--docs/development/requirements/04-gaps.rst389
1 files changed, 389 insertions, 0 deletions
diff --git a/docs/development/requirements/04-gaps.rst b/docs/development/requirements/04-gaps.rst
new file mode 100644
index 00000000..b8ff7f2e
--- /dev/null
+++ b/docs/development/requirements/04-gaps.rst
@@ -0,0 +1,389 @@
+.. This work is licensed under a Creative Commons Attribution 4.0 International License.
+.. http://creativecommons.org/licenses/by/4.0
+
+Gap analysis in upstream projects
+=================================
+
+This section presents the findings of gaps on existing VIM platforms. The focus
+was to identify gaps based on the features and requirements specified in Section
+3.3. The analysis work determined gaps that are presented here.
+
+VIM Northbound Interface
+------------------------
+
+Immediate Notification
+^^^^^^^^^^^^^^^^^^^^^^
+
+* Type: 'deficiency in performance'
+* Description
+
+ + To-be
+
+ - VIM has to notify unavailability of virtual resource (fault) to VIM user
+ immediately.
+ - Notification should be passed in '1 second' after fault detected/notified
+ by VIM.
+ - Also, the following conditions/requirement have to be met:
+
+ - Only the owning user can receive notification of fault related to owned
+ virtual resource(s).
+
+ + As-is
+
+ - OpenStack Metering 'Ceilometer' can notify unavailability of virtual
+ resource (fault) to the owner of virtual resource based on alarm
+ configuration by the user.
+
+ - Ceilometer Alarm API:
+ http://docs.openstack.org/developer/ceilometer/webapi/v2.html#alarms
+
+ - Alarm notifications are triggered by alarm evaluator instead of
+ notification agents that might receive faults
+
+ - Ceilometer Architecture:
+ http://docs.openstack.org/developer/ceilometer/architecture.html#id1
+
+ - Evaluation interval should be equal to or larger than configured pipeline
+ interval for collection of underlying metrics.
+
+ - https://github.com/openstack/ceilometer/blob/stable/juno/ceilometer/alarm/service.py#L38-42
+
+ - The interval for collection has to be set large enough which depends on
+ the size of the deployment and the number of metrics to be collected.
+ - The interval may not be less than one second in even small deployments.
+ The default value is 60 seconds.
+ - Alternative: OpenStack has a message bus to publish system events.
+ The operator can allow the user to connect this, but there are no
+ functions to filter out other events that should not be passed to the user
+ or which were not requested by the user.
+
+ + Gap
+
+ - Fault notifications cannot be received immediately by Ceilometer.
+
+* Solved by
+
+ + Event Alarm Evaluator:
+ https://specs.openstack.org/openstack/ceilometer-specs/specs/liberty/event-alarm-evaluator.html
+ + New OpenStack alarms and notifications project AODH:
+ http://docs.openstack.org/developer/aodh/
+
+Maintenance Notification
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Type: 'missing'
+* Description
+
+ + To-be
+
+ - VIM has to notify unavailability of virtual resource triggered by NFVI
+ maintenance to VIM user.
+ - Also, the following conditions/requirements have to be met:
+
+ - VIM should accept maintenance message from administrator and mark target
+ physical resource "in maintenance".
+ - Only the owner of virtual resource hosted by target physical resource
+ can receive the notification that can trigger some process for
+ applications which are running on the virtual resource (e.g. cut off
+ VM).
+
+ + As-is
+
+ - OpenStack: None
+ - AWS (just for study)
+
+ - AWS provides API and CLI to view status of resource (VM) and to create
+ instance status and system status alarms to notify you when an instance
+ has a failed status check.
+ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html
+ - AWS provides API and CLI to view scheduled events, such as a reboot or
+ retirement, for your instances. Also, those events will be notified
+ via e-mail.
+ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
+
+ + Gap
+
+ - VIM user cannot receive maintenance notifications.
+
+* Solved by
+
+ + https://blueprints.launchpad.net/nova/+spec/service-status-notification
+
+VIM Southbound interface
+------------------------
+
+Normalization of data collection models
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+* Type: 'missing'
+* Description
+
+ + To-be
+
+ - A normalized data format needs to be created to cope with the many data
+ models from different monitoring solutions.
+
+ + As-is
+
+ - Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
+ Zenoss). Although each solution establishes its own data models, no common
+ data abstraction models exist in OpenStack.
+
+ + Gap
+
+ - Normalized data format does not exist.
+
+* Solved by
+
+ + Specification in Section :ref:`southbound`.
+
+OpenStack
+---------
+
+Ceilometer
+^^^^^^^^^^
+
+OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
+the utilization of physical and virtual resources [CEIL]_. Ceilometer can
+collect a number of metrics across multiple OpenStack components and watch for
+variations and trigger alarms based upon the collected data.
+
+Scalability of fault aggregation
+________________________________
+
+* Type: 'scalability issue'
+* Description
+
+ + To-be
+
+ - Be able to scale to a large deployment, where thousands of monitoring
+ events per second need to be analyzed.
+
+ + As-is
+
+ - Performance issue when scaling to medium-sized deployments.
+
+ + Gap
+
+ - Ceilometer seems to be unsuitable for monitoring medium and large scale
+ NFVI deployments.
+
+* Solved by
+
+ + Usage of Zabbix for fault aggregation [ZABB]_. Zabbix can support a much
+ higher number of fault events (up to 15 thousand events per second, but
+ obviously also has some upper bound:
+ http://blog.zabbix.com/scalable-zabbix-lessons-on-hitting-9400-nvps/2615/
+
+ + Decentralized/hierarchical deployment with multiple instances, where one
+ instance is only responsible for a small NFVI.
+
+Monitoring of hardware and software
+___________________________________
+
+* Type: 'missing (lack of functionality)'
+* Description
+
+ + To-be
+
+ - OpenStack (as VIM) should monitor various hardware and software in NFVI to
+ handle faults on them by Ceilometer.
+ - OpenStack may have monitoring functionality in itself and can be
+ integrated with third party monitoring tools.
+ - OpenStack need to be able to detect the faults listed in the Annex.
+
+ + As-is
+
+ - For each deployment of OpenStack, an operator has responsibility to
+ configure monitoring tools with relevant scripts or plugins in order to
+ monitor hardware and software.
+ - OpenStack Ceilometer does not monitor hardware and software to capture
+ faults.
+
+ + Gap
+
+ - Ceilometer is not able to detect and handle all faults listed in the Annex.
+
+* Solved by
+
+ + Use of dedicated monitoring tools like Zabbix or Monasca.
+ See :ref:`nfvi_faults`.
+
+Nova
+^^^^
+
+OpenStack Nova [NOVA]_ is a mature and widely known and used component in
+OpenStack cloud deployments. It is the main part of an
+"infrastructure-as-a-service" system providing a cloud computing fabric
+controller, supporting a wide diversity of virtualization and container
+technologies.
+
+Nova has proven throughout these past years to be highly available and
+fault-tolerant. Featuring its own API, it also provides a compatibility API with
+Amazon EC2 APIs.
+
+Correct states when compute host is down
+________________________________________
+
+* Type: 'missing (lack of functionality)'
+* Description
+
+ + To-be
+
+ - The API shall support to change VM power state in case host has failed.
+ - The API shall support to change nova-compute state.
+ - There could be single API to change different VM states for all VMs
+ belonging to a specific host.
+ - Support external systems that are monitoring the infrastructure and resources
+ that are able to call the API fast and reliable.
+ - Resource states are reliable such that correlation actions can be fast and automated.
+ - User shall be able to read states from OpenStack and trust they are correct.
+
+ + As-is
+
+ - When a VM goes down due to a host HW, host OS or hypervisor failure,
+ nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
+ reported to be live and OK through the OpenStack API.
+ - nova-compute state might change too slowly or the state is not reliable
+ if expecting also VMs to be down. This leads to ability to schedule VMs
+ to a failed host and slowness blocks evacuation.
+
+ + Gap
+
+ - OpenStack does not change its states fast and reliably enough.
+ - The API does not support to have an external system to change states and to
+ trust the states are reliable (external system has fenced failed host).
+ - User cannot read all the states from OpenStack nor trust they are right.
+
+* Solved by
+
+ + https://blueprints.launchpad.net/nova/+spec/mark-host-down
+ + https://blueprints.launchpad.net/python-novaclient/+spec/support-force-down-service
+
+Evacuate VMs in Maintenance mode
+________________________________
+
+* Type: 'missing'
+* Description
+
+ + To-be
+
+ - When maintenance mode for a compute host is set, trigger VM evacuation to
+ available compute nodes before bringing the host down for maintenance.
+
+ + As-is
+
+ - If setting a compute node to a maintenance mode, OpenStack only schedules
+ evacuation of all VMs to available compute nodes if in-maintenance compute
+ node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
+ KVM) are not supported and, hence, guest VMs will likely stop running due
+ to maintenance actions administrator may perform (e.g. hardware upgrades,
+ OS updates).
+
+ + Gap
+
+ - Nova libvirt hypervisor driver does not implement automatic guest VMs
+ evacuation when compute nodes are set to maintenance mode (``$ nova
+ host-update --maintenance enable <hostname>``).
+
+Monasca
+^^^^^^^
+
+Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
+integrates with OpenStack. Even though it is still in its early days, it is the
+interest of the community that the platform be multi-tenant, highly scalable,
+performant and fault-tolerant. It provides a streaming alarm engine, a
+notification engine, and a northbound REST API users can use to interact with
+Monasca. Hundreds of thousands of metrics per second can be processed
+[MONA]_.
+
+Anomaly detection
+_________________
+
+
+* Type: 'missing (lack of functionality)'
+* Description
+
+ + To-be
+
+ - Detect the failure and perform a root cause analysis to filter out other
+ alarms that may be triggered due to their cascading relation.
+
+ + As-is
+
+ - A mechanism to detect root causes of failures is not available.
+
+ + Gap
+
+ - Certain failures can trigger many alarms due to their dependency on the
+ underlying root cause of failure. Knowing the root cause can help filter
+ out unnecessary and overwhelming alarms.
+
+* Status
+
+ + Monasca as of now lacks this feature, although the community is aware and
+ working toward supporting it.
+
+Sensor monitoring
+_________________
+
+* Type: 'missing (lack of functionality)'
+* Description
+
+ + To-be
+
+ - It should support monitoring sensor data retrieval, for instance, from
+ IPMI.
+
+ + As-is
+
+ - Monasca does not monitor sensor data
+
+ + Gap
+
+ - Sensor monitoring is very important. It provides operators status
+ on the state of the physical infrastructure (e.g. temperature, fans).
+
+* Addressed by
+
+ + Monasca can be configured to use third-party monitoring solutions (e.g.
+ Nagios, Cacti) for retrieving additional data.
+
+Hardware monitoring tools
+-------------------------
+
+Zabbix
+^^^^^^
+
+Zabbix is an open-source solution for monitoring availability and performance of
+infrastructure components (i.e. servers and network devices), as well as
+applications [ZABB]_. It can be customized for use with OpenStack. It is a
+mature tool and has been proven to be able to scale to large systems with
+100,000s of devices.
+
+Delay in execution of actions
+_____________________________
+
+
+* Type: 'deficiency in performance'
+* Description
+
+ + To-be
+
+ - After detecting a fault, the monitoring tool should immediately execute
+ the appropriate action, e.g. inform the manager through the NB I/F
+
+ + As-is
+
+ - A delay of around 10 seconds was measured in two independent testbed
+ deployments
+
+ + Gap
+
+ - Cause of the delay is a periodic evaluation and notification. Periodicity is configured
+ as 30s default value and can be reduced to 5s but not below.
+ https://github.com/zabbix/zabbix/blob/trunk/conf/zabbix_server.conf#L329
+
+
+..
+ vim: set tabstop=4 expandtab textwidth=80: