diff options
Diffstat (limited to 'docs/requirements/03-architecture.rst')
-rw-r--r-- | docs/requirements/03-architecture.rst | 298 |
1 files changed, 298 insertions, 0 deletions
diff --git a/docs/requirements/03-architecture.rst b/docs/requirements/03-architecture.rst new file mode 100644 index 00000000..9b618e01 --- /dev/null +++ b/docs/requirements/03-architecture.rst @@ -0,0 +1,298 @@ +High level architecture and general features +============================================ + +Functional overview +------------------- + +The Doctor project circles around two distinct use cases: 1) management of +failures of virtualized resources and 2) planned maintenance, e.g. migration, of +virtualized resources. Both of them may affect a VNF/application and the network +service it provides, but there is a difference in frequency and how they can be +handled. + +Failures are spontaneous events that may or may not have an impact on the +virtual resources. The Consumer should as soon as possible react to the failure, +e.g., by switching to the STBY node. The Consumer will then instruct the VIM on +how to clean up or repair the lost virtual resources, i.e. restore the VM, VLAN +or virtualized storage. How much the applications are affected varies. +Applications with built-in HA support might experience a short decrease in +retainability (e.g. an ongoing session might be lost) while keeping availability +(establishment or re-establishment of sessions are not affected), whereas the +impact on applications without built-in HA may be more serious. How much the +network service is impacted depends on how the service is implemented. With +sufficient network redundancy the service may be unaffected even when a specific +resource fails. + +On the other hand, planned maintenance impacting virtualized resources are events +that are known in advance. This group includes e.g. migration due to software +upgrades of OS and hypervisor on a compute host. Some of these might have been +requested by the application or its management solution, but there is also a +need for coordination on the actual operations on the virtual resources. There +may be an impact on the applications and the service, but since they are not +spontaneous events there is room for planning and coordination between the +application management organization and the infrastructure management +organization, including performing whatever actions that would be required to +minimize the problems. + +Failure prediction is the process of pro-actively identifying situations that +may lead to a failure in the future unless acted on by means of maintenance +activities. From applications' point of view, failure prediction may impact them +in two ways: either the warning time is so short that the application or its +management solution does not have time to react, in which case it is equal to +the failure scenario, or there is sufficient time to avoid the consequences by +means of maintenance activities, in which case it is similar to planned +maintenance. + +Architecture Overview +--------------------- + +NFV and the Cloud platform provide virtual resources and related control +functionality to users and administrators. :numref:`figure3` shows the high +level architecture of NFV focusing on the NFVI, i.e., the virtualized +infrastructure. The NFVI provides virtual resources, such as virtual machines +(VM) and virtual networks. Those virtual resources are used to run applications, +i.e. VNFs, which could be components of a network service which is managed by +the consumer of the NFVI. The VIM provides functionalities of controlling and +viewing virtual resources on hardware (physical) resources to the consumers, +i.e., users and administrators. OpenStack is a prominent candidate for this VIM. +The administrator may also directly control the NFVI without using the VIM. + +Although OpenStack is the target upstream project where the new functional +elements (Controller, Notifier, Monitor, and Inspector) are expected to be +implemented, a particular implementation method is not assumed. Some of these +elements may sit outside of OpenStack and offer a northbound interface to +OpenStack. + +General Features and Requirements +--------------------------------- + +The following features are required for the VIM to achieve high availability of +applications (e.g., MME, S/P-GW) and the Network Services: + +1. Monitoring: Monitor physical and virtual resources. +2. Detection: Detect unavailability of physical resources. +3. Correlation and Cognition: Correlate faults and identify affected virtual + resources. +4. Notification: Notify unavailable virtual resources to their Consumer(s). +5. Fencing: Shut down or isolate a faulty resource +6. Recovery action: Execute actions to process fault recovery and maintenance. + +The time interval between the instant that an event is detected by the +monitoring system and the Consumer notification of unavailable resources shall +be < 1 second (e.g., Step 1 to Step 4 in :numref:`figure4` and :numref:`figure5`). + +.. figure:: images/figure3.png + :name: figure3 + :width: 100% + + High level architecture + +Monitoring +^^^^^^^^^^ + +The VIM shall monitor physical and virtual resources for unavailability and +suspicious behavior. + +Detection +^^^^^^^^^ + +The VIM shall detect unavailability and failures of physical resources that +might cause errors/faults in virtual resources running on top of them. +Unavailability of physical resource is detected by various monitoring and +managing tools for hardware and software components. This may include also +predicting upcoming faults. Note, fault prediction is out of scope of this +project and is investigated in the OPNFV "Data Collection for Failure +Prediction" project [PRED]_. + +The fault items/events to be detected shall be configurable. + +The configuration shall enable Failure Selection and Aggregation. Failure +aggregation means the VIM determines unavailability of physical resource from +more than two non-critical failures related to the same resource. + +There are two types of unavailability - immediate and future: + +* Immediate unavailability can be detected by setting traps of raw failures on + hardware monitoring tools. +* Future unavailability can be found by receiving maintenance instructions + issued by the administrator of the NFVI or by failure prediction mechanisms. + +Correlation and Cognition +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The VIM shall correlate each fault to the impacted virtual resource, i.e., the +VIM shall identify unavailability of virtualized resources that are or will be +affected by failures on the physical resources under them. Unavailability of a +virtualized resource is determined by referring to the mapping of physical and +virtualized resources. + +VIM shall allow configuration of fault correlation between physical and +virtual resources. VIM shall support correlating faults: + +* between a physical resource and another physical resource +* between a physical resource and a virtual resource +* between a virtual resource and another virtual resource + +Failure aggregation is also required in this feature, e.g., a user may request +to be only notified if failures on more than two standby VMs in an (N+M) +deployment model occurred. + +Notification +^^^^^^^^^^^^ + +The VIM shall notify the alarm, i.e., unavailability of virtual resource(s), to +the Consumer owning it over the northbound interface, such that the Consumers +impacted by the failure can take appropriate actions to recover from the +failure. + +The VIM shall also notify the unavailability of physical resources to its +Administrator. + +All notifications shall be transferred immediately in order to minimize the +stalling time of the network service and to avoid over assignment caused by +delay of capability updates. + +There may be multiple consumers, so the VIM has to find out the owner of a +faulty resource. Moreover, there may be a large number of virtual and physical +resources in a real deployment, so polling the state of all resources to the VIM +would lead to heavy signaling traffic. Thus, a publication/subscription +messaging model is better suited for these notifications, as notifications are +only sent to subscribed consumers. + +Notifications will be send out along with the configuration by the consumer. +The configuration includes endpoint(s) in which the consumers can specify +multiple targets for the notification subscription, so that various and +multiple receiver functions can consume the notification message. +Also, the conditions for notifications shall be configurable, such that +the consumer can set according policies, e.g. whether it wants to receive +fault notifications or not. + +Note: the VIM should only accept notification subscriptions for each resource +by its owner or administrator. +Notifications to the Consumer about the unavailability of virtualized +resources will include a description of the fault, preferably with sufficient +abstraction rather than detailed physical fault information. + +.. _fencing: + +Fencing +^^^^^^^ +Recovery actions, e.g. safe VM evacuation, have to be preceded by fencing the +failed host. Fencing hereby means to isolate or shut down a faulty resource. +Without fencing -- when the perceived disconnection is due to some transient +or partial failure -- the evacuation might lead into two identical instances +running together and having a dangerous conflict. + +There is a cross-project definition in OpenStack of how to implement +fencing, but there has not been any progress. The general description is +available here: +https://wiki.openstack.org/wiki/Fencing_Instances_of_an_Unreachable_Host + +As OpenStack does not cover fencing it is in the responsibility of the Doctor +project to make sure fencing is done by using tools like pacemaker and by +calling OpenStack APIs. Only after fencing is done OpenStack resources can be +marked as down. In case there are gaps in OpenStack projects to have all +relevant resources marked as down, those gaps need to be identified and fixed. +The Doctor Inspector component will be responsible of marking resources down in +the OpenStack and back up if necessary. + +Recovery Action +^^^^^^^^^^^^^^^ + +In the basic :ref:`uc-fault1` use case, no automatic actions will be taken by +the VIM, but all recovery actions executed by the VIM and the NFVI will be +instructed and coordinated by the Consumer. + +In a more advanced use case, the VIM shall be able to recover the failed virtual +resources according to a pre-defined behavior for that resource. In principle +this means that the owner of the resource (i.e., its consumer or administrator) +can define which recovery actions shall be taken by the VIM. Examples are a +restart of the VM, migration/evacuation of the VM, or no action. + + + +High level northbound interface specification +--------------------------------------------- + +Fault management +^^^^^^^^^^^^^^^^ + +This interface allows the Consumer to subscribe to fault notification from the +VIM. Using a filter, the Consumer can narrow down which faults should be +notified. A fault notification may trigger the Consumer to switch from ACT to +STBY configuration and initiate fault recovery actions. A fault query +request/response message exchange allows the Consumer to find out about active +alarms at the VIM. A filter can be used to narrow down the alarms returned in +the response message. + +.. figure:: images/figure4.png + :name: figure4 + :width: 100% + + High-level message flow for fault management + +The high level message flow for the fault management use case is shown in +:numref:`figure4`. +It consists of the following steps: + +1. The VIM monitors the physical and virtual resources and the fault management + workflow is triggered by a monitored fault event. +2. Event correlation, fault detection and aggregation in VIM. Note: this may + also happen after Step 3. +3. Database lookup to find the virtual resources affected by the detected fault. +4. Fault notification to Consumer. +5. The Consumer switches to standby configuration (STBY) +6. Instructions to VIM requesting certain actions to be performed on the + affected resources, for example migrate/update/terminate specific + resource(s). After reception of such instructions, the VIM is executing the + requested action, e.g., it will migrate or terminate a virtual resource. + +NFVI Maintenance +^^^^^^^^^^^^^^^^ + +The NFVI maintenance interface allows the Administrator to notify the VIM about +a planned maintenance operation on the NFVI. A maintenance operation may for +example be an update of the server firmware or the hypervisor. The +MaintenanceRequest message contains instructions to change the state of the +resource from 'normal' to 'maintenance'. After receiving the MaintenanceRequest, +the VIM will notify the Consumer about the planned maintenance operation, +whereupon the Consumer will switch to standby (STBY) configuration to allow the +maintenance action to be executed. After the request was executed successfully +(i.e., the physical resources have been emptied) or the operation resulted in an +error state, the VIM sends a MaintenanceResponse message back to the +Administrator. + +.. figure:: images/figure5.png + :name: figure5 + :width: 100% + + High-level message flow for NFVI maintenance + +The high level message flow for the NFVI maintenance use case is shown in +:numref:`figure5`. +It consists of the following steps: + +1. Maintenance trigger received from administrator. +2. VIM switches the affected NFVI resources to "maintenance" state, i.e., the + NFVI resources are prepared for the maintenance operation. For example, the + virtual resources should not be used for further allocation/migration + requests and the VIM will coordinate with the Consumer on how to best empty + the physical resources. +3. Database lookup to find the virtual resources affected by the detected + maintenance operation. +4. StateChange notification to inform Consumer about planned maintenance + operation. +5. The Consumer switches to standby configuration (STBY) +6. Instructions from Consumer to VIM requesting certain actions to be performed + (step 6a). After receiving such instructions, the VIM executes the requested + action in order to empty the physical resources (step 6b) and informs the + Consumer is about the result of the actions. Note: this step is out of scope + of Doctor. +7. Maintenance response from VIM to inform the Administrator that the physical + machines have been emptied (or the operation resulted in an error state). +8. The Administrator is coordinating and executing the maintenance + operation/work on the NFVI. Note: this step is out of scope of Doctor. + +.. + vim: set tabstop=4 expandtab textwidth=80: + |