diff options
Diffstat (limited to 'docs/requirements/05-implementation.rst')
-rw-r--r-- | docs/requirements/05-implementation.rst | 1050 |
1 files changed, 0 insertions, 1050 deletions
diff --git a/docs/requirements/05-implementation.rst b/docs/requirements/05-implementation.rst deleted file mode 100644 index 84979772..00000000 --- a/docs/requirements/05-implementation.rst +++ /dev/null @@ -1,1050 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -Detailed architecture and interface specification -================================================= - -This section describes a detailed implementation plan, which is based on the -high level architecture introduced in Section 3. Section 5.1 describes the -functional blocks of the Doctor architecture, which is followed by a high level -message flow in Section 5.2. Section 5.3 provides a mapping of selected existing -open source components to the building blocks of the Doctor architecture. -Thereby, the selection of components is based on their maturity and the gap -analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of -the related northbound interface and the related information elements. Finally, -Section 5.6 provides a first set of blueprints to address selected gaps required -for the realization functionalities of the Doctor project. - -.. _impl_fb: - -Functional Blocks ------------------ - -This section introduces the functional blocks to form the VIM. OpenStack was -selected as the candidate for implementation. Inside the VIM, 4 different -building blocks are defined (see :numref:`figure6`). - -.. figure:: images/figure6.png - :name: figure6 - :width: 100% - - Functional blocks - -Monitor -^^^^^^^ - -The Monitor module has the responsibility for monitoring the virtualized -infrastructure. There are already many existing tools and services (e.g. Zabbix) -to monitor different aspects of hardware and software resources which can be -used for this purpose. - -Inspector -^^^^^^^^^ - -The Inspector module has the ability a) to receive various failure notifications -regarding physical resource(s) from Monitor module(s), b) to find the affected -virtual resource(s) by querying the resource map in the Controller, and c) to -update the state of the virtual resource (and physical resource). - -The Inspector has drivers for different types of events and resources to -integrate any type of Monitor and Controller modules. It also uses a failure -policy database to decide on the failure selection and aggregation from raw -events. This failure policy database is configured by the Administrator. - -The reason for separation of the Inspector and Controller modules is to make the -Controller focus on simple operations by avoiding a tight integration of various -health check mechanisms into the Controller. - -Controller -^^^^^^^^^^ - -The Controller is responsible for maintaining the resource map (i.e. the mapping -from physical resources to virtual resources), accepting update requests for the -resource state(s) (exposing as provider API), and sending all failure events -regarding virtual resources to the Notifier. Optionally, the Controller has the -ability to force the state of a given physical resource to down in the resource -mapping when it receives failure notifications from the Inspector for that -given physical resource. -The Controller also re-calculates the capacity of the NVFI when receiving a -failure notification for a physical resource. - -In a real-world deployment, the VIM may have several controllers, one for each -resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller -maintains a database of virtual and physical resources which shall be the master -source for resource information inside the VIM. - -Notifier -^^^^^^^^ - -The focus of the Notifier is on selecting and aggregating failure events -received from the controller based on policies mandated by the Consumer. -Therefore, it allows the Consumer to subscribe for alarms regarding virtual -resources using a method such as API endpoint. After receiving a fault -event from a Controller, it will notify the fault to the Consumer by referring -to the alarm configuration which was defined by the Consumer earlier on. - -To reduce complexity of the Controller, it is a good approach for the -Controllers to emit all notifications without any filtering mechanism and have -another service (i.e. Notifier) handle those notifications properly. This is the -general philosophy of notifications in OpenStack. Note that a fault message -consumed by the Notifier is different from the fault message received by the -Inspector; the former message is related to virtual resources which are visible -to users with relevant ownership, whereas the latter is related to raw devices -or small entities which should be handled with an administrator privilege. - -The northbound interface between the Notifier and the Consumer/Administrator is -specified in :ref:`impl_nbi`. - -Sequence --------- - -Fault Management -^^^^^^^^^^^^^^^^ - -The detailed work flow for fault management is as follows (see also :numref:`figure7`): - -1. Request to subscribe to monitor specific virtual resources. A query filter - can be used to narrow down the alarms the Consumer wants to be informed - about. -2. Each subscription request is acknowledged with a subscribe response message. - The response message contains information about the subscribed virtual - resources, in particular if a subscribed virtual resource is in "alarm" - state. -3. The NFVI sends monitoring events for resources the VIM has been subscribed - to. Note: this subscription message exchange between the VIM and NFVI is not - shown in this message flow. -4. Event correlation, fault detection and aggregation in VIM. -5. Database lookup to find the virtual resources affected by the detected fault. -6. Fault notification to Consumer. -7. The Consumer switches to standby configuration (STBY) -8. Instructions to VIM requesting certain actions to be performed on the - affected resources, for example migrate/update/terminate specific - resource(s). After reception of such instructions, the VIM is executing the - requested action, e.g. it will migrate or terminate a virtual resource. -a. Query request from Consumer to VIM to get information about the current - status of a resource. -b. Response to the query request with information about the current status of - the queried resource. In case the resource is in "fault" state, information - about the related fault(s) is returned. - -In order to allow for quick reaction to failures, the time interval between -fault detection in step 3 and the corresponding recovery actions in step 7 and 8 -shall be less than 1 second. - -.. figure:: images/figure7.png - :name: figure7 - :width: 100% - - Fault management work flow - -.. figure:: images/figure8.png - :name: figure8 - :width: 100% - - Fault management scenario - -:numref:`figure8` shows a more detailed message flow (Steps 4 to 6) between -the 4 building blocks introduced in :ref:`impl_fb`. - -4. The Monitor observed a fault in the NFVI and reports the raw fault to the - Inspector. - The Inspector filters and aggregates the faults using pre-configured - failure policies. - -5. - a) The Inspector queries the Resource Map to find the virtual resources - affected by the raw fault in the NFVI. - b) The Inspector updates the state of the affected virtual resources in the - Resource Map. - c) The Controller observes a change of the virtual resource state and informs - the Notifier about the state change and the related alarm(s). - Alternatively, the Inspector may directly inform the Notifier about it. - -6. The Notifier is performing another filtering and aggregation of the changes - and alarms based on the pre-configured alarm configuration. Finally, a fault - notification is sent to northbound to the Consumer. - -NFVI Maintenance -^^^^^^^^^^^^^^^^ -.. figure:: images/figure9.png - :name: figure9 - :width: 100% - - NFVI maintenance work flow - -The detailed work flow for NFVI maintenance is shown in :numref:`figure9` -and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI -maintenance work flow are very similar to the steps in the fault management work -flow and share a similar implementation plan in Release 1. - -1. Subscribe to fault/maintenance notifications. -2. Response to subscribe request. -3. Maintenance trigger received from administrator. -4. VIM switches NFVI resources to "maintenance" state. This, e.g., means they - should not be used for further allocation/migration requests -5. Database lookup to find the virtual resources affected by the detected - maintenance operation. -6. Maintenance notification to Consumer. -7. The Consumer switches to standby configuration (STBY) -8. Instructions from Consumer to VIM requesting certain recovery actions to be - performed (step 8a). After reception of such instructions, the VIM is - executing the requested action in order to empty the physical resources (step - 8b). -9. Maintenance response from VIM to inform the Administrator that the physical - machines have been emptied (or the operation resulted in an error state). -10. Administrator is coordinating and executing the maintenance operation/work - on the NFVI. -a) Query request from Administrator to VIM to get information about the - current state of a resource. -b) Response to the query request with information about the current state of - the queried resource(s). In case the resource is in "maintenance" state, - information about the related maintenance operation is returned. - -.. figure:: images/figure10.png - :name: figure10 - :width: 100% - - NFVI Maintenance implementation plan - -:numref:`figure10` shows a more detailed message flow (Steps 3 to 6 and 9) -between the 4 building blocks introduced in Section 5.1.. - -3. The Administrator is sending a StateChange request to the Controller residing - in the VIM. -4. The Controller queries the Resource Map to find the virtual resources - affected by the planned maintenance operation. -5. - - a) The Controller updates the state of the affected virtual resources in the - Resource Map database. - - b) The Controller informs the Notifier about the virtual resources that will - be affected by the maintenance operation. - -6. A maintenance notification is sent to northbound to the Consumer. - -... - -9. The Controller informs the Administrator after the physical resources have - been freed. - - - -Implementation plan for OPNFV Release 1 ---------------------------------------- - -Fault management -^^^^^^^^^^^^^^^^ - -:numref:`figure11` shows the implementation plan based on OpenStack and -related components as planned for Release 1. Hereby, the Monitor can be realized -by Zabbix. The Controller is realized by OpenStack Nova [NOVA]_, Neutron -[NEUT]_, and Cinder [CIND]_ for compute, network, and storage, -respectively. The Inspector can be realized by Monasca [MONA]_ or a simple -script querying Nova in order to map between physical and virtual resources. The -Notifier will be realized by Ceilometer [CEIL]_ receiving failure events -on its notification bus. - -:numref:`figure12` shows the inner-workings of Ceilometer. After receiving -an "event" on its notification bus, first a notification agent will grab the -event and send a "notification" to the Collector. The collector writes the -notifications received to the Ceilometer databases. - -In the existing Ceilometer implementation, an alarm evaluator is periodically -polling those databases through the APIs provided. If it finds new alarms, it -will evaluate them based on the pre-defined alarm configuration, and depending -on the configuration, it will hand a message to the Alarm Notifier, which in -turn will send the alarm message northbound to the Consumer. :numref:`figure12` -also shows an optimized work flow for Ceilometer with the goal to -reduce the delay for fault notifications to the Consumer. The approach is to -implement a new notification agent (called "publisher" in Ceilometer -terminology) which is directly sending the alarm through the "Notification Bus" -to a new "Notification-driven Alarm Evaluator (NAE)" (see Sections 5.6.2 and -5.6.3), thereby bypassing the Collector and avoiding the additional delay of the -existing polling-based alarm evaluator. The NAE is similar to the OpenStack -"Alarm Evaluator", but is triggered by incoming notifications instead of -periodically polling the OpenStack "Alarms" database for new alarms. The -Ceilometer "Alarms" database can hold three states: "normal", "insufficient -data", and "fired". It is representing a persistent alarm database. In order to -realize the Doctor requirements, we need to define new "meters" in the database -(see Section 5.6.1). - -.. figure:: images/figure11.png - :name: figure11 - :width: 100% - - Implementation plan in OpenStack (OPNFV Release 1 ”Arno”) - - -.. figure:: images/figure12.png - :name: figure12 - :width: 100% - - Implementation plan in Ceilometer architecture - - -NFVI Maintenance -^^^^^^^^^^^^^^^^ - -For NFVI Maintenance, a quite similar implementation plan exists. Instead of a -raw fault being observed by the Monitor, the Administrator is sending a -Maintenance Request through the northbound interface towards the Controller -residing in the VIM. Similar to the Fault Management use case, the Controller -(in our case OpenStack Nova) will send a maintenance event to the Notifier (i.e. -Ceilometer in our implementation). Within Ceilometer, the same workflow as -described in the previous section applies. In addition, the Controller(s) will -take appropriate actions to evacuate the physical machines in order to prepare -them for the planned maintenance operation. After the physical machines are -emptied, the Controller will inform the Administrator that it can initiate the -maintenance. Alternatively the VMs can just be shut down and boot up on the -same host after maintenance is over. There needs to be policy for administrator -to know the plan for VMs in maintenance. - -Information elements --------------------- - -This section introduces all attributes and information elements used in the -messages exchange on the northbound interfaces between the VIM and the VNFO and -VNFM. - -Note: The information elements will be aligned with current work in ETSI NFV IFA -working group. - - -Simple information elements: - -* SubscriptionID (Identifier): identifies a subscription to receive fault or maintenance - notifications. -* NotificationID (Identifier): identifies a fault or maintenance notification. -* VirtualResourceID (Identifier): identifies a virtual resource affected by a - fault or a maintenance action of the underlying physical resource. -* PhysicalResourceID (Identifier): identifies a physical resource affected by a - fault or maintenance action. -* VirtualResourceState (String): state of a virtual resource, e.g. "normal", - "maintenance", "down", "error". -* PhysicalResourceState (String): state of a physical resource, e.g. "normal", - "maintenance", "down", "error". -* VirtualResourceType (String): type of the virtual resource, e.g. "virtual - machine", "virtual memory", "virtual storage", "virtual CPU", or "virtual - NIC". -* FaultID (Identifier): identifies the related fault in the underlying physical - resource. This can be used to correlate different fault notifications caused - by the same fault in the physical resource. -* FaultType (String): Type of the fault. The allowed values for this parameter - depend on the type of the related physical resource. For example, a resource - of type "compute hardware" may have faults of type "CPU failure", "memory - failure", "network card failure", etc. -* Severity (Integer): value expressing the severity of the fault. The higher the - value, the more severe the fault. -* MinSeverity (Integer): value used in filter information elements. Only faults - with a severity higher than the MinSeverity value will be notified to the - Consumer. -* EventTime (Datetime): Time when the fault was observed. -* EventStartTime and EventEndTime (Datetime): Datetime range that can be used in - a FaultQueryFilter to narrow down the faults to be queried. -* ProbableCause (String): information about the probable cause of the fault. -* CorrelatedFaultID (Integer): list of other faults correlated to this fault. -* isRootCause (Boolean): Parameter indicating if this fault is the root for - other correlated faults. If TRUE, then the faults listed in the parameter - CorrelatedFaultID are caused by this fault. -* FaultDetails (Key-value pair): provides additional information about the - fault, e.g. information about the threshold, monitored attributes, indication - of the trend of the monitored parameter. -* FirmwareVersion (String): current version of the firmware of a physical - resource. -* HypervisorVersion (String): current version of a hypervisor. -* ZoneID (Identifier): Identifier of the resource zone. A resource zone is the - logical separation of physical and software resources in an NFVI deployment - for physical isolation, redundancy, or administrative designation. -* Metadata (Key-value pair): provides additional information of a physical - resource in maintenance/error state. - -Complex information elements (see also UML diagrams in :numref:`figure13` -and :numref:`figure14`): - -* VirtualResourceInfoClass: - - + VirtualResourceID [1] (Identifier) - + VirtualResourceState [1] (String) - + Faults [0..*] (FaultClass): For each resource, all faults - including detailed information about the faults are provided. - -* FaultClass: The parameters of the FaultClass are partially based on ETSI TS - 132 111-2 (V12.1.0) [*]_, which is specifying fault management in 3GPP, in - particular describing the information elements used for alarm notifications. - - - FaultID [1] (Identifier) - - FaultType [1] (String) - - Severity [1] (Integer) - - EventTime [1] (Datetime) - - ProbableCause [1] (String) - - CorrelatedFaultID [0..*] (Identifier) - - FaultDetails [0..*] (Key-value pair) - -.. [*] http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf - -* SubscribeFilterClass - - - VirtualResourceType [0..*] (String) - - VirtualResourceID [0..*] (Identifier) - - FaultType [0..*] (String) - - MinSeverity [0..1] (Integer) - -* FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it - limits the query to certain physical resources, a certain zone, a given fault - type/severity/cause, or a specific FaultID. - - - VirtualResourceType [0..*] (String) - - VirtualResourceID [0..*] (Identifier) - - FaultType [0..*] (String) - - MinSeverity [0..1] (Integer) - - EventStartTime [0..1] (Datetime) - - EventEndTime [0..1] (Datetime) - -* PhysicalResourceStateClass: - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String): mandates the new state of the physical - resource. - - Metadata [0..*] (Key-value pair) - -* PhysicalResourceInfoClass: - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String) - - FirmwareVersion [0..1] (String) - - HypervisorVersion [0..1] (String) - - ZoneID [0..1] (Identifier) - - Metadata [0..*] (Key-value pair) - -* StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits - the query to certain physical resources, a certain zone, or a given resource - state (e.g., only resources in "maintenance" state). - - - PhysicalResourceID [1] (Identifier) - - PhysicalResourceState [1] (String) - - ZoneID [0..1] (Identifier) - -.. _impl_nbi: - -Detailed northbound interface specification -------------------------------------------- - -This section is specifying the northbound interfaces for fault management and -NFVI maintenance between the VIM on the one end and the Consumer and the -Administrator on the other ends. For each interface all messages and related -information elements are provided. - -Note: The interface definition will be aligned with current work in ETSI NFV IFA -working group . - -All of the interfaces described below are produced by the VIM and consumed by -the Consumer or Administrator. - -Fault management interface -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This interface allows the VIM to notify the Consumer about a virtual resource -that is affected by a fault, either within the virtual resource itself or by the -underlying virtualization infrastructure. The messages on this interface are -shown in :numref:`figure13` and explained in detail in the following -subsections. - -Note: The information elements used in this section are described in detail in -Section 5.4. - -.. figure:: images/figure13.png - :name: figure13 - :width: 100% - - Fault management NB I/F messages - - -SubscribeRequest (Consumer -> VIM) -__________________________________ - -Subscription from Consumer to VIM to be notified about faults of specific -resources. The faults to be notified about can be narrowed down using a -subscribe filter. - -Parameters: - -- SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow - down the faults that shall be notified to the Consumer, for example limit to - specific VirtualResourceID(s), severity, or cause of the alarm. - -SubscribeResponse (VIM -> Consumer) -___________________________________ - -Response to a subscribe request message including information about the -subscribed resources, in particular if they are in "fault/error" state. - -Parameters: - -* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It - can be used to delete or update the subscription. -* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional - information about the subscribed resources, i.e., a list of the related - resources, the current state of the resources, etc. - -FaultNotification (VIM -> Consumer) -___________________________________ - -Notification about a virtual resource that is affected by a fault, either within -the virtual resource itself or by the underlying virtualization infrastructure. -After reception of this request, the Consumer will decide on the optimal -action to resolve the fault. This includes actions like switching to a hot -standby virtual resource, migration of the fault virtual resource to another -physical machine, termination of the faulty virtual resource and instantiation -of a new virtual resource in order to provide a new hot standby resource. In -some use cases the Consumer can leave virtual resources on failed host to be -booted up again after fault is recovered. Existing resource management -interfaces and messages between the Consumer and the VIM can be used for those -actions, and there is no need to define additional actions on the Fault -Management Interface. - -Parameters: - -* NotificationID [1] (Identifier): Unique identifier for the notification. -* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty - resources with detailed information about the faults. - -FaultQueryRequest (Consumer -> VIM) -___________________________________ - -Request to find out about active alarms at the VIM. A FaultQueryFilter can be -used to narrow down the alarms returned in the response message. - -Parameters: - -* FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the - FaultQueryRequest, for example it limits the query to certain physical - resources, a certain zone, a given fault type/severity/cause, or a specific - FaultID. - -FaultQueryResponse (VIM -> Consumer) -____________________________________ - -List of active alarms at the VIM matching the FaultQueryFilter specified in the -FaultQueryRequest. - -Parameters: - -* VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty - resources. For each resource all faults including detailed information about - the faults are provided. - -NFVI maintenance -^^^^^^^^^^^^^^^^ - -The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to -maintenance notifications provided by the VIM. The related maintenance interface -Administrator-VIM allows the Administrator to issue maintenance requests to the -VIM, i.e. requesting the VIM to take appropriate actions to empty physical -machine(s) in order to execute maintenance operations on them. The interface -also allows the Administrator to query the state of physical machines, e.g., in -order to get details in the current status of the maintenance operation like a -firmware update. - -The messages defined in these northbound interfaces are shown in :numref:`figure14` -and described in detail in the following subsections. - -.. figure:: images/figure14.png - :name: figure14 - :width: 100% - - NFVI maintenance NB I/F messages - -SubscribeRequest (Consumer -> VIM) -__________________________________ - -Subscription from Consumer to VIM to be notified about maintenance operations -for specific virtual resources. The resources to be informed about can be -narrowed down using a subscribe filter. - -Parameters: - -* SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the - faults that shall be notified to the Consumer, for example limit to specific - virtual resource type(s). - -SubscribeResponse (VIM -> Consumer) -___________________________________ - -Response to a subscribe request message, including information about the -subscribed virtual resources, in particular if they are in "maintenance" state. - -Parameters: - -* SubscriptionID [1] (Identifier): Unique identifier for the subscription. It - can be used to delete or update the subscription. -* VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional - information about the subscribed virtual resource(s), e.g., the ID, type and - current state of the resource(s). - -MaintenanceNotification (VIM -> Consumer) -_________________________________________ - -Notification about a physical resource switched to "maintenance" state. After -reception of this request, the Consumer will decide on the optimal action to -address this request, e.g., to switch to the standby (STBY) configuration. - -Parameters: - -* VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual - resources where the state has been changed to maintenance. - -StateChangeRequest (Administrator -> VIM) -_________________________________________ - -Request to change the state of a list of physical resources, e.g. to -"maintenance" state, in order to prepare them for a planned maintenance -operation. - -Parameters: - -* PhysicalResourceState [1..*] (PhysicalResourceStateClass) - -StateChangeResponse (VIM -> Administrator) -__________________________________________ - -Response message to inform the Administrator that the requested resources are -now in maintenance state (or the operation resulted in an error) and the -maintenance operation(s) can be executed. - -Parameters: - -* PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass) - -StateQueryRequest (Administrator -> VIM) -________________________________________ - -In this procedure, the Administrator would like to get the information about -physical machine(s), e.g. their state ("normal", "maintenance"), firmware -version, hypervisor version, update status of firmware and hypervisor, etc. It -can be used to check the progress during firmware update and the confirmation -after update. A filter can be used to narrow down the resources returned in the -response message. - -Parameters: - -* StateQueryFilter [1] (StateQueryFilterClass): narrows down the - StateQueryRequest, for example it limits the query to certain physical - resources, a certain zone, or a given resource state. - -StateQueryResponse (VIM -> Administrator) -_________________________________________ - -List of physical resources matching the filter specified in the -StateQueryRequest. - -Parameters: - -* PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical - resources. For each resource, information about the current state, the - firmware version, etc. is provided. - -NFV IFA, OPNFV Doctor and AODH alarms -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This section compares the alarm interfaces of ETSI NFV IFA with the specifications -of this document and the alarm class of AODH. - -ETSI NFV specifies an interface for alarms from virtualised resources in ETSI GS -NFV-IFA 005 [ENFV]_. The interface specifies an Alarm class and two notifications plus -operations to query alarm instances and to subscribe to the alarm notifications. - -The specification in this document has a structure that is very similar to the -ETSI NFV specifications. The notifications differ in that an alarm notification -in the NFV interface defines a single fault for a single resource while the -notification specified in this document can contain multiple faults for -multiple resources. The Doctor specification is lacking the detailed time stamps -of the NFV specification essential for synchronizaion of the alarm list -using the query operation. The detailed time stamps are also of value in the event -and alarm history DBs. - -AODH defines a base class for alarms, not the notifications. This means that -some of the dynamic attributes of the ETSI NFV alarm type, like alarmRaisedTime, -are not applicable to the AODH alarm class but are attributes of in the actual -notifications. (Description of these attributes will be added later.) The AODH alarm -class is lacking some attributes present in the NFV specification, fault details -and correlated alarms. Instead the AODH alarm class has attributes for actions, -rules and user and project id. - - -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| ETSI NFV Alarm Type | OPNFV Doctor | AODH Event Alarm | Description / Comment | Recommendations | -| | Requirement Specs | Notification | | | -+========================+========================+=====================+=============================================+=======================================+ -| alarmId | FaultId | alarm_id | Identifier of an alarm. | \- | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | alarm_name | Human readable alarm name. | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| managedObjectId | VirtualResourceId | (reason) | Identifier of the affected virtual resource | \- | -| | | | is part of the AODH reason parameter. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | user_id, project_id | User and project identifiers. | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmRaisedTime | \- | \- | Timestamp when alarm was raised. | To be added to Doctor and AODH. May | -| | | | | be derived (e.g. in a shimlayer) from | -| | | | | the AODH alarm history. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmChangedTime | \- | \- | Timestamp when alarm was changed/updated. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| alarmClearedTime | \- | \- | Timestamp when alarm was cleared. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| eventTime | \- | \- | Timestamp when alarm was first observed by | see above | -| | | | the Monitor. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | EventTime | generated | Timestamp of the Notification. | Update parameter name in Doctor spec. | -| | | | | May be added in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| state: | VirtualResourceState: | current: ok, alarm, | ETSI NFV IFA 005/006 lists example alarm | Maintenance state is missing in AODH. | -| E.g. Fired, Updated | E.g. normal, down | insufficient_data | states. | List of alarm states will be | -| Cleared | maintenance, error | | | specified in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| perceivedSeverity: | Severity (Integer) | Severity: | ETSI NFV IFA 005/006 lists example | List of alarm states will be | -| E.g. Critical, Major, | | low (default), | perceived severity values. | specified in ETSI NFV Stage 3. | -| Minor, Warning, | | moderate, critical | | | -| Indeterminate, Cleared | | | | **OPNFV: Severity (Integer)**: | -| | | | | * update OPNFV Doctor specification | -| | | | | to *Enum* | -| | | | | | -| | | | | **perceivedSeverity=Indetermined**: | -| | | | | * remove value *Indetermined* in | -| | | | | IFA and map undefined values to | -| | | | | “minor” severity, or | -| | | | | * add value *indetermined* in AODH | -| | | | | and make it the default value. | -| | | | | | -| | | | | **perceivedSeverity=Cleared**: | -| | | | | * remove value *Cleared* in IFA as | -| | | | | the information about a cleared | -| | | | | alarm alarm can be derived from | -| | | | | the alarm state parameter, or | -| | | | | * add value *cleared* in AODH and | -| | | | | set a rule that the severity is | -| | | | | “cleared” when the state is *ok*. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| faultType | FaultType | event_type in | Type of the fault, e.g. “CPU failure” of a | OpenStack Alarming (Aodh) can use a | -| | | reason_data | compute resource, in machine interpretable | fuzzy matching with wildcard string, | -| | | | format. | "compute.cpu.failure". | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| N/A | N/A | type = "event" | Type of the notification. For fault | \- | -| | | | notifications the type in AODH is “event”. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| probableCause | ProbableCause | \- | Probable cause of the alarm. | May be provided (e.g. in a shimlayer) | -| | | | | based on Vitrage topology awareness / | -| | | | | root-cause-analysis. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| isRootCause | IsRootCause | \- | Boolean indicating whether the fault is the | see above | -| | | | root cause of other faults. | | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| correlatedAlarmId | CorrelatedFaultId | \- | List of IDs of correlated faults. | see above | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| faultDetails | FaultDetails | \- | Additional details about the fault/alarm. | FaultDetails information element will | -| | | | | be specified in ETSI NFV Stage 3. | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ -| \- | \- | action, previous | Additional AODH alarm related parameters. | \- | -+------------------------+------------------------+---------------------+---------------------------------------------+---------------------------------------+ - -Table: Comparison of alarm attributes - -The primary area of improvement should be alignment of the perceived severity. This -is important for a quick and accurate evaluation of the alarm. AODH thus should -support also the X.733 values Critical, Major, Minor, Warning and Indeterminate. - -The detailed time stamps (raised, changed, cleared) which are essential for -synchronizing the alarm list using a query operation should be added to the -Doctor specification. - -Other areas that need alignment is the so called alarm state in NFV. Here we must -however consider what can be attributes of the notification vs. what should be a -property of the alarm instance. This will be analyzed later. - -.. _southbound: - -Detailed southbound interface specification -------------------------------------------- - -This section is specifying the southbound interfaces for fault management -between the Monitors and the Inspector. -Although southbound interfaces should be flexible to handle various events from -different types of Monitors, we define unified event API in order to improve -interoperability between the Monitors and the Inspector. -This is not limiting implementation of Monitor and Inspector as these could be -extended in order to support failures from intelligent inspection like prediction. - -Note: The interface definition will be aligned with current work in ETSI NFV IFA -working group. - -Fault event interface -^^^^^^^^^^^^^^^^^^^^^ - -This interface allows the Monitors to notify the Inspector about an event which -was captured by the Monitor and may effect resources managed in the VIM. - -EventNotification -_________________ - - -Event notification including fault description. -The entity of this notification is event, and not fault or error specifically. -This allows us to use generic event format or framework build out of Doctor project. -The parameters below shall be mandatory, but keys in 'Details' can be optional. - -Parameters: - -* Time [1]: Datetime when the fault was observed in the Monitor. -* Type [1]: Type of event that will be used to process correlation in Inspector. -* Details [0..1]: Details containing additional information with Key-value pair style. - Keys shall be defined depending on the Type of the event. - -E.g.: - -.. code-block:: bash - - { - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.down', - 'details': { - 'hostname': 'compute-1', - 'source': 'sample_monitor', - 'cause': 'link-down', - 'severity': 'critical', - 'status': 'down', - 'monitor_id': 'monitor-1', - 'monitor_event_id': '123', - } - } - } - -Optional parameters in 'Details': - -* Hostname: the hostname on which the event occurred. -* Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as 'KVM'. -* Cause: description of the cause of this event which could be different from the type of this event. -* Severity: the severity of this event set by the monitor. -* Status: the status of target object in which error occurred. -* MonitorID: the ID of the monitor sending this event. -* MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log. -* RelatedTo: the array of IDs which related to this event. - -Also, we can have bulk API to receive multiple events in a single HTTP POST -message by using the 'events' wrapper as follows: - -.. code-block:: bash - - { - 'events': [ - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.down', - 'details': {}, - }, - 'event': { - 'time': '2016-04-12T08:00:00', - 'type': 'compute.host.nic.error', - 'details': {}, - } - ] - } - - - - -Blueprints ----------- - -This section is listing a first set of blueprints that have been proposed by the -Doctor project to the open source community. Further blueprints addressing other -gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In -this section the following definitions are used: - -* "Event" is a message emitted by other OpenStack services such as Nova and - Neutron and is consumed by the "Notification Agents" in Ceilometer. -* "Notification" is a message generated by a "Notification Agent" in Ceilometer - based on an "event" and is delivered to the "Collectors" in Ceilometer that - store those notifications (as "sample") to the Ceilometer "Databases". - -Instance State Notification (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The Doctor project is planning to handle "events" and "notifications" regarding -Resource Status; Instance State, Port State, Host State, etc. Currently, -Ceilometer already receives "events" to identify the state of those resources, -but it does not handle and store them yet. This is why we also need a new event -definition to capture those resource states from "events" created by other -services. - -This BP proposes to add a new compute notification state to handle events from -an instance (server) from nova. It also creates a new meter "instance.state" in -OpenStack. - -.. [*] https://etherpad.opnfv.org/p/doctor_bps - -Event Publisher for Alarm (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - - The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically - querying/polling the databases in order to check all alarms independently from - other processes. This is adding additional delay to the fault notification - send to the Consumer, whereas one requirement of Doctor is to react on faults - as fast as possible. - - The existing message flow is shown in :numref:`figure12`: after receiving - an "event", a "notification agent" (i.e. "event publisher") will send a - "notification" to a "Collector". The "collector" is collecting the - notifications and is updating the Ceilometer "Meter" database that is storing - information about the "sample" which is capured from original "event". The - "Alarm Evaluator" is periodically polling this databases then querying "Meter" - database based on each alarm configuration. - - In the current Ceilometer implementation, there is no possibility to directly - trigger the "Alarm Evaluator" when a new "event" was received, but the "Alarm - Evaluator" will only find out that requires firing new notification to the - Consumer when polling the database. - -**Change/feature request:** - - This BP proposes to add a new "event publisher for alarm", which is bypassing - several steps in Ceilometer in order to avoid the polling-based approach of - the existing Alarm Evaluator that makes notification slow to users. - - After receiving an "(alarm) event" by listening on the Ceilometer message - queue ("notification bus"), the new "event publisher for alarm" immediately - hands a "notification" about this event to a new Ceilometer component - "Notification-driven alarm evaluator" proposed in the other BP (see Section - 5.6.3). - - Note, the term "publisher" refers to an entity in the Ceilometer architecture - (it is a "notification agent"). It offers the capability to provide - notifications to other services outside of Ceilometer, but it is also used to - deliver notifications to other Ceilometer components (e.g. the "Collectors") - via the Ceilometer "notification bus". - -**Implementation detail** - - * "Event publisher for alarm" is part of Ceilometer - * The standard AMQP message queue is used with a new topic string. - * No new interfaces have to be added to Ceilometer. - * "Event publisher for Alarm" can be configured by the Administrator of - Ceilometer to be used as "Notification Agent" in addition to the existing - "Notifier" - * Existing alarm mechanisms of Ceilometer can be used allowing users to - configure how to distribute the "notifications" transformed from "events", - e.g. there is an option whether an ongoing alarm is re-issued or not - ("repeat_actions"). - -.. [*] https://etherpad.opnfv.org/p/doctor_bps - -Notification-driven alarm evaluator (Ceilometer) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - -The existing "Alarm Evaluator" in OpenStack Ceilometer is periodically -querying/polling the databases in order to check all alarms independently from -other processes. This is adding additional delay to the fault notification send -to the Consumer, whereas one requirement of Doctor is to react on faults as fast -as possible. - -**Change/feature request:** - -This BP is proposing to add an alternative "Notification-driven Alarm Evaluator" -for Ceilometer that is receiving "notifications" sent by the "Event Publisher -for Alarm" described in the other BP. Once this new "Notification-driven Alarm -Evaluator" received "notification", it finds the "alarm" configurations which -may relate to the "notification" by querying the "alarm" database with some keys -i.e. resource ID, then it will evaluate each alarm with the information in that -"notification". - -After the alarm evaluation, it will perform the same way as the existing "alarm -evaluator" does for firing alarm notification to the Consumer. Similar to the -existing Alarm Evaluator, this new "Notification-driven Alarm Evaluator" is -aggregating and correlating different alarms which are then provided northbound -to the Consumer via the OpenStack "Alarm Notifier". The user/administrator can -register the alarm configuration via existing Ceilometer API [*]_. Thereby, he -can configure whether to set an alarm or not and where to send the alarms to. - -**Implementation detail** - -* The new "Notification-driven Alarm Evaluator" is part of Ceilometer. -* Most of the existing source code of the "Alarm Evaluator" can be re-used to - implement this BP -* No additional application logic is needed -* It will access the Ceilometer Databases just like the existing "Alarm - evaluator" -* Only the polling-based approach will be replaced by a listener for - "notifications" provided by the "Event Publisher for Alarm" on the Ceilometer - "notification bus". -* No new interfaces have to be added to Ceilometer. - - -.. [*] https://etherpad.opnfv.org/p/doctor_bps -.. [*] https://wiki.openstack.org/wiki/Ceilometer/Alerting - -Report host fault to update server state immediately (Nova) [*]_ -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -**Problem statement:** - -* Nova state change for failed or unreachable host is slow and does not reliably - state host is down or not. This might cause same server instance to run twice - if action taken to evacuate instance to another host. -* Nova state for server(s) on failed host will not change, but remains active - and running. This gives the user false information about server state. -* VIM northbound interface notification of host faults towards VNFM and NFVO - should be in line with OpenStack state. This fault notification is a Telco - requirement defined in ETSI and will be implemented by OPNFV Doctor project. -* Openstack user cannot make HA actions fast and reliably by trusting server - state and host state. - -**Proposed change:** - -There needs to be a new API for Admin to state host is down. This API is used to -mark services running in host down to reflect the real situation. - -Example on compute node is: - -* When compute node is up and running::: - - vm_state: activeand power_state: running - nova-compute state: up status: enabled - -* When compute node goes down and new API is called to state host is down::: - - vm_state: stopped power_state: shutdown - nova-compute state: down status: enabled - -**Alternatives:** - -There is no attractive alternative to detect all different host faults than to -have an external tool to detect different host faults. For this kind of tool to -exist there needs to be new API in Nova to report fault. Currently there must be -some kind of workarounds implemented as cannot trust or get the states from -OpenStack fast enough. - -.. [*] https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately - -Other related BPs -^^^^^^^^^^^^^^^^^ - -This section lists some BPs related to Doctor, but proposed by drafters outside -the OPNFV community. - -pacemaker-servicegroup-driver [*]_ -__________________________________ - -This BP will detect and report host down quite fast to OpenStack. This however -might not work properly for example when management network has some problem and -host reported faulty while VM still running there. This might lead to launching -same VM instance twice causing problems. Also NB IF message needs fault reason -and for that the source needs to be a tool that detects different kind of faults -as Doctor will be doing. Also this BP might need enhancement to change server -and service states correctly. - -.. [*] https://blueprints.launchpad.net/nova/+spec/pacemaker-servicegroup-driver |