diff options
Diffstat (limited to 'docs/requirements/03-architecture.rst')
-rw-r--r-- | docs/requirements/03-architecture.rst | 340 |
1 files changed, 0 insertions, 340 deletions
diff --git a/docs/requirements/03-architecture.rst b/docs/requirements/03-architecture.rst deleted file mode 100644 index b7417691..00000000 --- a/docs/requirements/03-architecture.rst +++ /dev/null @@ -1,340 +0,0 @@ -.. This work is licensed under a Creative Commons Attribution 4.0 International License. -.. http://creativecommons.org/licenses/by/4.0 - -High level architecture and general features -============================================ - -Functional overview -------------------- - -The Doctor project circles around two distinct use cases: 1) management of -failures of virtualized resources and 2) planned maintenance, e.g. migration, of -virtualized resources. Both of them may affect a VNF/application and the network -service it provides, but there is a difference in frequency and how they can be -handled. - -Failures are spontaneous events that may or may not have an impact on the -virtual resources. The Consumer should as soon as possible react to the failure, -e.g., by switching to the STBY node. The Consumer will then instruct the VIM on -how to clean up or repair the lost virtual resources, i.e. restore the VM, VLAN -or virtualized storage. How much the applications are affected varies. -Applications with built-in HA support might experience a short decrease in -retainability (e.g. an ongoing session might be lost) while keeping availability -(establishment or re-establishment of sessions are not affected), whereas the -impact on applications without built-in HA may be more serious. How much the -network service is impacted depends on how the service is implemented. With -sufficient network redundancy the service may be unaffected even when a specific -resource fails. - -On the other hand, planned maintenance impacting virtualized resources are events -that are known in advance. This group includes e.g. migration due to software -upgrades of OS and hypervisor on a compute host. Some of these might have been -requested by the application or its management solution, but there is also a -need for coordination on the actual operations on the virtual resources. There -may be an impact on the applications and the service, but since they are not -spontaneous events there is room for planning and coordination between the -application management organization and the infrastructure management -organization, including performing whatever actions that would be required to -minimize the problems. - -Failure prediction is the process of pro-actively identifying situations that -may lead to a failure in the future unless acted on by means of maintenance -activities. From applications' point of view, failure prediction may impact them -in two ways: either the warning time is so short that the application or its -management solution does not have time to react, in which case it is equal to -the failure scenario, or there is sufficient time to avoid the consequences by -means of maintenance activities, in which case it is similar to planned -maintenance. - -Architecture Overview ---------------------- - -NFV and the Cloud platform provide virtual resources and related control -functionality to users and administrators. :numref:`figure3` shows the high -level architecture of NFV focusing on the NFVI, i.e., the virtualized -infrastructure. The NFVI provides virtual resources, such as virtual machines -(VM) and virtual networks. Those virtual resources are used to run applications, -i.e. VNFs, which could be components of a network service which is managed by -the consumer of the NFVI. The VIM provides functionalities of controlling and -viewing virtual resources on hardware (physical) resources to the consumers, -i.e., users and administrators. OpenStack is a prominent candidate for this VIM. -The administrator may also directly control the NFVI without using the VIM. - -Although OpenStack is the target upstream project where the new functional -elements (Controller, Notifier, Monitor, and Inspector) are expected to be -implemented, a particular implementation method is not assumed. Some of these -elements may sit outside of OpenStack and offer a northbound interface to -OpenStack. - -General Features and Requirements ---------------------------------- - -The following features are required for the VIM to achieve high availability of -applications (e.g., MME, S/P-GW) and the Network Services: - -1. Monitoring: Monitor physical and virtual resources. -2. Detection: Detect unavailability of physical resources. -3. Correlation and Cognition: Correlate faults and identify affected virtual - resources. -4. Notification: Notify unavailable virtual resources to their Consumer(s). -5. Fencing: Shut down or isolate a faulty resource. -6. Recovery action: Execute actions to process fault recovery and maintenance. - -The time interval between the instant that an event is detected by the -monitoring system and the Consumer notification of unavailable resources shall -be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4`). - -.. figure:: images/figure3.png - :name: figure3 - :width: 100% - - High level architecture - -Monitoring -^^^^^^^^^^ - -The VIM shall monitor physical and virtual resources for unavailability and -suspicious behavior. - -Detection -^^^^^^^^^ - -The VIM shall detect unavailability and failures of physical resources that -might cause errors/faults in virtual resources running on top of them. -Unavailability of physical resource is detected by various monitoring and -managing tools for hardware and software components. This may include also -predicting upcoming faults. Note, fault prediction is out of scope of this -project and is investigated in the OPNFV "Data Collection for Failure -Prediction" project [PRED]_. - -The fault items/events to be detected shall be configurable. - -The configuration shall enable Failure Selection and Aggregation. Failure -aggregation means the VIM determines unavailability of physical resource from -more than two non-critical failures related to the same resource. - -There are two types of unavailability - immediate and future: - -* Immediate unavailability can be detected by setting traps of raw failures on - hardware monitoring tools. -* Future unavailability can be found by receiving maintenance instructions - issued by the administrator of the NFVI or by failure prediction mechanisms. - -Correlation and Cognition -^^^^^^^^^^^^^^^^^^^^^^^^^ - -The VIM shall correlate each fault to the impacted virtual resource, i.e., the -VIM shall identify unavailability of virtualized resources that are or will be -affected by failures on the physical resources under them. Unavailability of a -virtualized resource is determined by referring to the mapping of physical and -virtualized resources. - -VIM shall allow configuration of fault correlation between physical and -virtual resources. VIM shall support correlating faults: - -* between a physical resource and another physical resource -* between a physical resource and a virtual resource -* between a virtual resource and another virtual resource - -Failure aggregation is also required in this feature, e.g., a user may request -to be only notified if failures on more than two standby VMs in an (N+M) -deployment model occurred. - -Notification -^^^^^^^^^^^^ - -The VIM shall notify the alarm, i.e., unavailability of virtual resource(s), to -the Consumer owning it over the northbound interface, such that the Consumers -impacted by the failure can take appropriate actions to recover from the -failure. - -The VIM shall also notify the unavailability of physical resources to its -Administrator. - -All notifications shall be transferred immediately in order to minimize the -stalling time of the network service and to avoid over assignment caused by -delay of capability updates. - -There may be multiple consumers, so the VIM has to find out the owner of a -faulty resource. Moreover, there may be a large number of virtual and physical -resources in a real deployment, so polling the state of all resources to the VIM -would lead to heavy signaling traffic. Thus, a publication/subscription -messaging model is better suited for these notifications, as notifications are -only sent to subscribed consumers. - -Notifications will be send out along with the configuration by the consumer. -The configuration includes endpoint(s) in which the consumers can specify -multiple targets for the notification subscription, so that various and -multiple receiver functions can consume the notification message. -Also, the conditions for notifications shall be configurable, such that -the consumer can set according policies, e.g. whether it wants to receive -fault notifications or not. - -Note: the VIM should only accept notification subscriptions for each resource -by its owner or administrator. -Notifications to the Consumer about the unavailability of virtualized -resources will include a description of the fault, preferably with sufficient -abstraction rather than detailed physical fault information. - -.. _fencing: - -Fencing -^^^^^^^ -Recovery actions, e.g. safe VM evacuation, have to be preceded by fencing the -failed host. Fencing hereby means to isolate or shut down a faulty resource. -Without fencing -- when the perceived disconnection is due to some transient -or partial failure -- the evacuation might lead into two identical instances -running together and having a dangerous conflict. - -There is a cross-project definition in OpenStack of how to implement -fencing, but there has not been any progress. The general description is -available here: -https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host - -OpenStack provides some mechanisms that allow fencing of faulty resources. Some -are automatically invoked by the platform itself (e.g. Nova disables the -compute service when libvirtd stops running, preventing new VMs to be scheduled -to that node), while other mechanisms are consumer trigger-based actions (e.g. -Neutron port admin-state-up). For other fencing actions not supported by -OpenStack, the Doctor project may suggest ways to address the gap (e.g. through -means of resourcing to external tools and orchestration methods), or -documenting or implementing them upstream. - -The Doctor Inspector component will be responsible of marking resources down in -the OpenStack and back up if necessary. - -Recovery Action -^^^^^^^^^^^^^^^ - -In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by -the VIM, but all recovery actions executed by the VIM and the NFVI will be -instructed and coordinated by the Consumer. - -In a more advanced use case, the VIM may be able to recover the failed virtual -resources according to a pre-defined behavior for that resource. In principle -this means that the owner of the resource (i.e., its consumer or administrator) -can define which recovery actions shall be taken by the VIM. Examples are a -restart of the VM or migration/evacuation of the VM. - - - -High level northbound interface specification ---------------------------------------------- - -Fault Management -^^^^^^^^^^^^^^^^ - -This interface allows the Consumer to subscribe to fault notification from the -VIM. Using a filter, the Consumer can narrow down which faults should be -notified. A fault notification may trigger the Consumer to switch from ACT to -STBY configuration and initiate fault recovery actions. A fault query -request/response message exchange allows the Consumer to find out about active -alarms at the VIM. A filter can be used to narrow down the alarms returned in -the response message. - -.. figure:: images/figure4.png - :name: figure4 - :width: 100% - - High-level message flow for fault management - -The high level message flow for the fault management use case is shown in -:numref:`figure4`. -It consists of the following steps: - -1. The VIM monitors the physical and virtual resources and the fault management - workflow is triggered by a monitored fault event. -2. Event correlation, fault detection and aggregation in VIM. Note: this may - also happen after Step 3. -3. Database lookup to find the virtual resources affected by the detected fault. -4. Fault notification to Consumer. -5. The Consumer switches to standby configuration (STBY). -6. Instructions to VIM requesting certain actions to be performed on the - affected resources, for example migrate/update/terminate specific - resource(s). After reception of such instructions, the VIM is executing the - requested action, e.g., it will migrate or terminate a virtual resource. - -NFVI Maintenance -^^^^^^^^^^^^^^^^ - -The NFVI maintenance interface allows the Administrator to notify the VIM about -a planned maintenance operation on the NFVI. A maintenance operation may for -example be an update of the server firmware or the hypervisor. The -MaintenanceRequest message contains instructions to change the state of the -physical resource from 'enabled' to 'going-to-maintenance' and a timeout [#timeout]_. -After receiving the MaintenanceRequest,the VIM decides on the actions to be taken -based on maintenance policies predefined by the affected Consumer(s). - -.. [#timeout] Timeout is set by the Administrator and corresponds to the maximum time - to empty the physical resources. - -.. figure:: images/figure5a.png - :name: figure5a - :width: 100% - - High-level message flow for maintenance policy enforcement - -The high level message flow for the NFVI maintenance policy enforcement is shown -in :numref:`figure5a`. It consists of the following steps: - -1. Maintenance trigger received from Administrator. -2. VIM switches the affected physical resources to "going-to-maintenance" state e.g. so that no new - VM will be scheduled on the physical servers. -3. Database lookup to find the Consumer(s) and virtual resources affected by the maintenance - operation. -4. Maintenance policies are enforced in the VIM, e.g. affected VM(s) are shut down - on the physical server(s), or affected Consumer(s) are notified about the planned - maintenance operation (steps 4a/4b). - - -Once the affected Consumer(s) have been notified, they take specific actions (e.g. switch to standby -(STBY) configuration, request to terminate the virtual resource(s)) to allow the maintenance -action to be executed. After the physical resources have been emptied, the VIM puts the physical -resources in "in-maintenance" state and sends a MaintenanceResponse back to the Administrator. - -.. figure:: images/figure5b.png - :name: figure5b - :width: 100% - - Successful NFVI maintenance - -The high level message flow for a successful NFVI maintenance is show in :numref:`figure5b`. -It consists of the following steps: - -5. The Consumer C3 switches to standby configuration (STBY). -6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed - (steps 6a, 6b). After receiving such instructions, the VIM executes the requested - action in order to empty the physical resources (step 6c) and informs the - Consumer about the result of the actions (steps 6d, 6e). -7. The VIM switches the physical resources to "in-maintenance" state -8. Maintenance response is sent from VIM to inform the Administrator that the physical - servers have been emptied. -9. The Administrator is coordinating and executing the maintenance - operation/work on the NFVI. Note: this step is out of scope of Doctor project. - -The requested actions to empty the physical resources may not be successful (e.g. migration fails -or takes too long) and in such a case, the VIM puts the physical resources back to 'enabled' and -informs the Administrator about the problem. - -.. figure:: images/figure5c.png - :name: figure5c - :width: 100% - - Example of failed NFVI maintenance - -An example of a high level message flow to cover the failed NFVI maintenance case is -shown in :numref:`figure5c`. -It consists of the following steps: - -5. The Consumer C3 switches to standby configuration (STBY). -6. Instructions from Consumers C2/C3 are shared to VIM requesting certain actions to be performed - (steps 6a, 6b). The VIM executes the requested actions and sends back a NACK to consumer C2 - (step 6d) as the migration of the virtual resource(s) is not completed by the given timeout. -7. The VIM switches the physical resources to "enabled" state. -8. MaintenanceNotification is sent from VIM to inform the Administrator that the maintenance action - cannot start. - - -.. - vim: set tabstop=4 expandtab textwidth=80: - |