diff options
Diffstat (limited to 'docs/development')
-rw-r--r-- | docs/development/requirements/01-intro.rst | 183 | ||||
-rwxr-xr-x | docs/development/requirements/02-collectd.rst | 103 | ||||
-rw-r--r-- | docs/development/requirements/03-dpdk.rst | 170 | ||||
-rwxr-xr-x | docs/development/requirements/barometer_scope.png | bin | 0 -> 39958 bytes | |||
-rw-r--r-- | docs/development/requirements/dpdk_ka.png | bin | 0 -> 100808 bytes | |||
-rw-r--r-- | docs/development/requirements/index.rst | 14 | ||||
-rw-r--r-- | docs/development/requirements/stats_and_timestamps.png | bin | 0 -> 52193 bytes |
7 files changed, 470 insertions, 0 deletions
diff --git a/docs/development/requirements/01-intro.rst b/docs/development/requirements/01-intro.rst new file mode 100644 index 00000000..bc0e9ba0 --- /dev/null +++ b/docs/development/requirements/01-intro.rst @@ -0,0 +1,183 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) OPNFV, Intel Corporation and others. + +Introduction +============ +Barometer is the project that renames Software Fastpath service Quality Metrics +(SFQM) and updates its scope which was networking centric. + +The goal of SFQM was to develop the utilities and libraries in DPDK to +support: + +* Measuring Telco Traffic and Performance KPIs. Including: + + * Packet Delay Variation (by enabling TX and RX time stamping). + * Packet loss (by exposing extended NIC stats). + +* Performance Monitoring of the DPDK interfaces (by exposing + extended NIC stats + collectd Plugin). +* Detecting and reporting violations that can be consumed by VNFs + and higher level management systems (through DPDK Keep Alive). + +With Barometer the scope is extended to monitoring the NFVI. The ability to +monitor the Network Function Virtualization Infrastructure (NFVI) where VNFs +are in operation will be a key part of Service Assurance within an NFV +environment, in order to enforce SLAs or to detect violations, faults or +degradation in the performance of NFVI resources so that events and relevant +metrics are reported to higher level fault management systems. +If physical appliances are going to be replaced by virtualized appliances +the service levels, manageability and service assurance needs to remain +consistent or improve on what is available today. As such, the NFVI needs to +support the ability to monitor: + +* Traffic monitoring and performance monitoring of the components that provide + networking functionality to the VNF, including: physical interfaces, virtual + switch interfaces and flows, as well as the virtual interfaces themselves and + their status, etc. +* Platform monitoring including: CPU, memory, load, cache, themals, fan speeds, + voltages and machine check exceptions, etc. + +All of the statistics and events gathered must be collected in-service and must +be capable of being reported by standard Telco mechanisms (e.g. SNMP), for +potential enforcement or correction actions. In addition, this information +could be fed to analytics systems to enable failure prediction, and can also be +used for intelligent workload placement. + + +All developed features will be upstreamed to Open Source projects relevant to +telemetry such as `collectd`_ and `Ceilometer`_. + +The OPNFV project wiki can be found @ `Barometer`_ + +Problem Statement +================== +Providing carrier grade Service Assurance is critical in the network +transformation to a software defined and virtualized network (NFV). +Medium-/large-scale cloud environments account for between hundreds and +hundreds of thousands of infrastructure systems. It is vital to monitor +systems for malfunctions that could lead to users application service +disruption and promptly react to these fault events to facilitate improving +overall system performance. As the size of infrastructure and virtual resources +grow, so does the effort of monitoring back-ends. SFQM aims to expose as much +useful information as possible off the platform so that faults and errors in +the NFVI can be detected promptly and reported to the appropriate fault +management entity. + +The OPNFV platform (NFVI) requires functionality to: + +* Create a low latency, high performance packet processing path (fast path) + through the NFVI that VNFs can take advantage of; +* Measure Telco Traffic and Performance KPIs through that fast path; +* Detect and report violations that can be consumed by VNFs and higher level + EMS/OSS systems + +Examples of local measurable QoS factors for Traffic Monitoring which impact +both Quality of Experience and five 9's availability would be (using Metro Ethernet +Forum Guidelines as reference): + +* Packet loss +* Packet Delay Variation +* Uni-directional frame delay + +Other KPIs such as Call drops, Call Setup Success Rate, Call Setup time etc. are +measured by the VNF. + +In addition to Traffic Monitoring, the NFVI must also support Performance +Monitoring of the physical interfaces themselves (e.g. NICs), i.e. an ability to +monitor and trace errors on the physical interfaces and report them. + +All these traffic statistics for Traffic and Performance Monitoring must be +measured in-service and must be capable of being reported by standard Telco +mechanisms (e.g. SNMP traps), for potential enforcement actions. + +Barometer updated scope +======================= +The scope of the project is to provide interfaces to support monitoring of the +NFVI. The project will develop plugins for telemetry frameworks to enable the +collection of platform stats and events and relay gathered information to fault +management applications or the VIM. The scope is limited to +collecting/gathering the events and stats and relaying them to a relevant +endpoint. The project will not enforce or take any actions based on the +gathered information. + +.. image: barometer_scope.png + +Scope of SFQM +============= +**NOTE:** The SFQM project has been replaced by Barometer. +The output of the project will provide interfaces and functions to support +monitoring of Packet Latency and Network Interfaces while the VNF is in service. + +The DPDK interface/API will be updated to support: + +* Exposure of NIC MAC/PHY Level Counters +* Interface for Time stamp on RX +* Interface for Time stamp on TX +* Exposure of DPDK events + +collectd will be updated to support the exposure of DPDK metrics and events. + +Specific testing and integration will be carried out to cover: + +* Unit/Integration Test plans: A sample application provided to demonstrate packet + latency monitoring and interface monitoring + +The following list of features and functionality will be developed: + +* DPDK APIs and functions for latency and interface monitoring +* A sample application to demonstrate usage +* collectd plugins + +The scope of the project involves developing the relavant DPDK APIs, OVS APIs, +sample applications, as well as the utilities in collectd to export all the +relavent information to a telemetry and events consumer. + +VNF specific processing, Traffic Monitoring, Performance Monitoring and +Management Agent are out of scope. + +The Proposed Interface counters include: + +* Packet RX +* Packet TX +* Packet loss +* Interface errors + other stats + +The Proposed Packet Latency Monitor include: + +* Cycle accurate stamping on ingress +* Supports latency measurements on egress + +Support for failover of DPDK enabled cores is also out of scope of the current +proposal. However, this is an important requirement and must-have functionality +for any DPDK enabled framework in the NFVI. To that end, a second phase of this +project will be to implement DPDK Keep Alive functionality that would address +this and would report to a VNF-level Failover and High Availability mechanism +that would then determine what actions, including failover, may be triggered. + +Consumption Models +=================== +In reality many VNFs will have an existing performance or traffic monitoring +utility used to monitor VNF behavior and report statistics, counters, etc. + +The consumption of performance and traffic related information/events provided +by this project should be a logical extension of any existing VNF/NFVI monitoring +framework. It should not require a new framework to be developed. We do not see +the Barometer gathered metrics and evetns as major additional effort for +monitoring frameworks to consume; this project would be sympathetic to existing +monitoring frameworks. The intention is that this project represents an +interface for NFVI monitoring to be used by higher level fault management +entities (see below). + +Allowing the Barometer metrics and events to be handled within existing +telemetry frameoworks makes it simpler for overall interfacing with higher +level management components in the VIM, MANO and OSS/BSS. The Barometer +proposal would be complementary to the Doctor project, which addresses NFVI Fault +Management support in the VIM, and the VES project, which addresses the +integration of VNF telemetry-related data into automated VNF management +systems. To that end, the project committers and contributors for the Barometer +project wish to collaborate with the Doctor and VES projects to facilitate this. + +.. _Barometer: https://wiki.opnfv.org/display/fastpath +.. _collectd: http://collectd.org/ +.. _Ceilometer: https://wiki.openstack.org/wiki/Telemetry diff --git a/docs/development/requirements/02-collectd.rst b/docs/development/requirements/02-collectd.rst new file mode 100755 index 00000000..2303fadc --- /dev/null +++ b/docs/development/requirements/02-collectd.rst @@ -0,0 +1,103 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) OPNFV, Intel Corporation and others. + +collectd +~~~~~~~~ +collectd is a daemon which collects system performance statistics periodically +and provides a variety of mechanisms to publish the collected metrics. It +supports more than 90 different input and output plugins. Input plugins retrieve +metrics and publish them to the collectd deamon, while output plugins publish +the data they receive to an end point. collectd also has infrastructure to +support thresholding and notification. + +collectd statistics and Notifications +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Within collectd notifications and performance data are dispatched in the same +way. There are producer plugins (plugins that create notifications/metrics), +and consumer plugins (plugins that receive notifications/metrics and do +something with them). + +Statistics in collectd consist of a value list. A value list includes: + +* Values, can be one of: + + * Derive: used for values where a change in the value since it's last been + read is of interest. Can be used to calculate and store a rate. + + * Counter: similar to derive values, but take the possibility of a counter + wrap around into consideration. + + * Gauge: used for values that are stored as is. + + * Absolute: used for counters that are reset after reading. + +* Value length: the number of values in the data set. + +* Time: timestamp at which the value was collected. + +* Interval: interval at which to expect a new value. + +* Host: used to identify the host. + +* Plugin: used to identify the plugin. + +* Plugin instance (optional): used to group a set of values together. For e.g. + values belonging to a DPDK interface. + +* Type: unit used to measure a value. In other words used to refer to a data + set. + +* Type instance (optional): used to distinguish between values that have an + identical type. + +* meta data: an opaque data structure that enables the passing of additional + information about a value list. "Meta data in the global cache can be used to + store arbitrary information about an identifier" [7]. + +Host, plugin, plugin instance, type and type instance uniquely identify a +collectd value. + +Values lists are often accompanied by data sets that describe the values in more +detail. Data sets consist of: + +* A type: a name which uniquely identifies a data set. + +* One or more data sources (entries in a data set) which include: + + * The name of the data source. If there is only a single data source this is + set to "value". + + * The type of the data source, one of: counter, gauge, absolute or derive. + + * A min and a max value. + +Types in collectd are defined in types.db. Examples of types in types.db: + +.. code-block:: console + + bitrate value:GAUGE:0:4294967295 + counter value:COUNTER:U:U + if_octets rx:COUNTER:0:4294967295, tx:COUNTER:0:4294967295 + +In the example above if_octets has two data sources: tx and rx. + +Notifications in collectd are generic messages containing: + +* An associated severity, which can be one of OKAY, WARNING, and FAILURE. + +* A time. + +* A Message + +* A host. + +* A plugin. + +* A plugin instance (optional). + +* A type. + +* A types instance (optional). + +* Meta-data. diff --git a/docs/development/requirements/03-dpdk.rst b/docs/development/requirements/03-dpdk.rst new file mode 100644 index 00000000..ad7c8c78 --- /dev/null +++ b/docs/development/requirements/03-dpdk.rst @@ -0,0 +1,170 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) OPNFV, Intel Corporation and others. + +DPDK Enhancements +================== +This section will discuss the Barometer features that were integrated with DPDK. + +Measuring Telco Traffic and Performance KPIs +-------------------------------------------- +This section will discuss the Barometer features that enable Measuring Telco Traffic +and Performance KPIs. + +.. Figure:: stats_and_timestamps.png + + Measuring Telco Traffic and Performance KPIs + +* The very first thing Barometer enabled was a call-back API in DPDK and an + associated application that used the API to demonstrate how to timestamp + packets and measure packet latency in DPDK (the sample app is called + rxtx_callbacks). This was upstreamed to DPDK 2.0 and is represented by + the interfaces 1 and 2 in Figure 1.2. + +* The second thing Barometer implemented in DPDK is the extended NIC statistics API, + which exposes NIC stats including error stats to the DPDK user by reading the + registers on the NIC. This is represented by interface 3 in Figure 1.2. + + * For DPDK 2.1 this API was only implemented for the ixgbe (10Gb) NIC driver, + in association with a sample application that runs as a DPDK secondary + process and retrieves the extended NIC stats. + + * For DPDK 2.2 the API was implemented for igb, i40e and all the Virtual + Functions (VFs) for all drivers. + + * For DPDK 16.07 the API migrated from using string value pairs to using id + value pairs, improving the overall performance of the API. + +Monitoring DPDK interfaces +-------------------------- +With the features Barometer enabled in DPDK to enable measuring Telco traffic and +performance KPIs, we can now retrieve NIC statistics including error stats and +relay them to a DPDK user. The next step is to enable monitoring of the DPDK +interfaces based on the stats that we are retrieving from the NICs, by relaying +the information to a higher level Fault Management entity. To enable this Barometer +has been enabling a number of plugins for collectd. + +DPDK Keep Alive description +--------------------------- +SFQM aims to enable fault detection within DPDK, the very first feature to +meet this goal is the DPDK Keep Alive Sample app that is part of DPDK 2.2. + +DPDK Keep Alive or KA is a sample application that acts as a heartbeat/watchdog +for DPDK packet processing cores, to detect application thread failure. The +application supports the detection of ‘failed’ DPDK cores and notification to a +HA/SA middleware. The purpose is to detect Packet Processing Core fails (e.g. +infinite loop) and ensure the failure of the core does not result in a fault +that is not detectable by a management entity. + +.. Figure:: dpdk_ka.png + + DPDK Keep Alive Sample Application + +Essentially the app demonstrates how to detect 'silent outages' on DPDK packet +processing cores. The application can be decomposed into two specific parts: +detection and notification. + +* The detection period is programmable/configurable but defaults to 5ms if no + timeout is specified. +* The Notification support is enabled by simply having a hook function that where this + can be 'call back support' for a fault management application with a compliant + heartbeat mechanism. + +DPDK Keep Alive Sample App Internals +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +This section provides some explanation of the The Keep-Alive/'Liveliness' +conceptual scheme as well as the DPDK Keep Alive App. The initialization and +run-time paths are very similar to those of the L2 forwarding application (see +`L2 Forwarding Sample Application (in Real and Virtualized Environments)`_ for more +information). + +There are two types of cores: a Keep Alive Monitor Agent Core (master DPDK core) +and Worker cores (Tx/Rx/Forwarding cores). The Keep Alive Monitor Agent Core +will supervise worker cores and report any failure (2 successive missed pings). +The Keep-Alive/'Liveliness' conceptual scheme is: + +* DPDK worker cores mark their liveliness as they forward traffic. +* A Keep Alive Monitor Agent Core runs a function every N Milliseconds to + inspect worker core liveliness. +* If keep-alive agent detects time-outs, it notifies the fault management + entity through a call-back function. + +**Note:** Only the worker cores state is monitored. There is no mechanism or agent +to monitor the Keep Alive Monitor Agent Core. + +DPDK Keep Alive Sample App Code Internals +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The following section provides some explanation of the code aspects that are +specific to the Keep Alive sample application. + +The heartbeat functionality is initialized with a struct rte_heartbeat and the +callback function to invoke in the case of a timeout. + +.. code:: c + + rte_global_keepalive_info = rte_keepalive_create(&dead_core, NULL); + if (rte_global_hbeat_info == NULL) + rte_exit(EXIT_FAILURE, "keepalive_create() failed"); + +The function that issues the pings hbeat_dispatch_pings() is configured to run +every check_period milliseconds. + +.. code:: c + + if (rte_timer_reset(&hb_timer, + (check_period * rte_get_timer_hz()) / 1000, + PERIODICAL, + rte_lcore_id(), + &hbeat_dispatch_pings, rte_global_keepalive_info + ) != 0 ) + rte_exit(EXIT_FAILURE, "Keepalive setup failure.\n"); + +The rest of the initialization and run-time path follows the same paths as the +the L2 forwarding application. The only addition to the main processing loop is +the mark alive functionality and the example random failures. + +.. code:: c + + rte_keepalive_mark_alive(&rte_global_hbeat_info); + cur_tsc = rte_rdtsc(); + + /* Die randomly within 7 secs for demo purposes.. */ + if (cur_tsc - tsc_initial > tsc_lifetime) + break; + +The rte_keepalive_mark_alive() function simply sets the core state to alive. + +.. code:: c + + static inline void + rte_keepalive_mark_alive(struct rte_heartbeat *keepcfg) + { + keepcfg->state_flags[rte_lcore_id()] = 1; + } + +Keep Alive Monitor Agent Core Monitoring Options +The application can run on either a host or a guest. As such there are a number +of options for monitoring the Keep Alive Monitor Agent Core through a Local +Agent on the compute node: + + ====================== ========== ============= + Application Location DPDK KA LOCAL AGENT + ====================== ========== ============= + HOST X HOST/GUEST + GUEST X HOST/GUEST + ====================== ========== ============= + + +For the first implementation of a Local Agent SFQM will enable: + + ====================== ========== ============= + Application Location DPDK KA LOCAL AGENT + ====================== ========== ============= + HOST X HOST + ====================== ========== ============= + +Through extending the dpdkstat plugin for collectd with KA functionality, and +integrating the extended plugin with Monasca for high performing, resilient, +and scalable fault detection. + +.. _L2 Forwarding Sample Application (in Real and Virtualized Environments): http://dpdk.org/doc/guides/sample_app_ug/l2_forward_real_virtual.html diff --git a/docs/development/requirements/barometer_scope.png b/docs/development/requirements/barometer_scope.png Binary files differnew file mode 100755 index 00000000..03783bde --- /dev/null +++ b/docs/development/requirements/barometer_scope.png diff --git a/docs/development/requirements/dpdk_ka.png b/docs/development/requirements/dpdk_ka.png Binary files differnew file mode 100644 index 00000000..4a45e10c --- /dev/null +++ b/docs/development/requirements/dpdk_ka.png diff --git a/docs/development/requirements/index.rst b/docs/development/requirements/index.rst new file mode 100644 index 00000000..e5d04896 --- /dev/null +++ b/docs/development/requirements/index.rst @@ -0,0 +1,14 @@ +.. This work is licensed under a Creative Commons Attribution 4.0 International License. +.. http://creativecommons.org/licenses/by/4.0 +.. (c) OPNFV, Intel Corporation and others. + +********************** +Barometer Requirements +********************** +.. toctree:: + :maxdepth: 3 + :numbered: + + 01-intro.rst + 02-collectd.rst + 03-dpdk.rst diff --git a/docs/development/requirements/stats_and_timestamps.png b/docs/development/requirements/stats_and_timestamps.png Binary files differnew file mode 100644 index 00000000..84aef726 --- /dev/null +++ b/docs/development/requirements/stats_and_timestamps.png |