diff options
-rw-r--r-- | Section_1.rst | 237 | ||||
-rw-r--r-- | Section_2_Hardware_HA.rst | 186 |
2 files changed, 423 insertions, 0 deletions
diff --git a/Section_1.rst b/Section_1.rst new file mode 100644 index 0000000..0b8d72f --- /dev/null +++ b/Section_1.rst @@ -0,0 +1,237 @@ +=================================================== +1.0 Overall Principle for High Availability in NFV +=================================================== + +The ultimate goal for the High Availability schema is to provide high +availability to the upper layer services. + +High availability is provided by the following steps once a failure happens: + + Step 1: failover of services once failure happens and service is out of work + + Step 2: Recovery of failed parts in each layer. + +****************************************** +1.1 Framework for High Availability in NFV +****************************************** + +Framework for Carrier Grade High availability: + +A layered approach to availability is required for the following reasons: + +* fault isolation +* fault tolerance +* fault recovery + +Among the OPNFV projects the OPNFV-HA project's focus is on requirements related +to service high availability. This is complemented by other projects such as the +OPNFV - Doctor project, whose focus is reporting and management of faults along +with maintenance, the OPNFV-Escalator project that considers the upgrade of the +NFVI and VIM, or the OPNFV-Multisite that adds geographical redundancy to the +picture. + +A layered approach allows the definition of failure domains (e.g., the +networking hardware, the distributed storage system, etc.). If possible, a fault +shall be handled at the layer (failure domain) where it occurs. If a failure +cannot be handled at its corresponding layer, the next higher layer needs to be +able to handle it. In no case, shall a failure cause cascading failures at other +layers. + +The layers are: + + ++---------------------------+-------------------------------------+ ++ Service + End customer visible service | ++===========================+=====================================+ ++ Application + VNF's, VNFC's | ++---------------------------+-------------------------------------+ ++ NFVI/VIM + Infrastructure, VIM, VNFM, VM | ++---------------------------+-------------------------------------+ ++ Hardware + Servers, COTS platforms | ++---------------------------+-------------------------------------+ + +The following document describes the various layers and how they need to +address high availability. + +************** +1.2 Definitons +************** + +Reference from the ETSI NFV doc. + +**Availability:** Availability of an item to be in a state to perform a required +function at a given instant of time or at any instant of time within a given +time interval, assuming that the external resources, if required, are provided. + +**Accessibility:** It is the ability of a service to access (physical) resources +necessary to provide that service. If the target service satisfies the minimum +level of accessibility, it is possible to provide this service to end users. + +**Admission control:** It is the administrative decision (e.g. by operator's +policy) to actually provide a service. In order to provide a more stable and +reliable service, admission control may require better performance and/or +additional resources than the minimum requirement. Failure: deviation of the +delivered service from fulfilling the system function. + +**Fault:** adjudged or hypothesized cause of an error + +**Service availability:** service availability of <Service X> is the long-term +average of the ratio of aggregate time between interruptions to scheduled +service time of <ServiceX> (expressed as a percentage) on a user-to-user basis. +The time between interruptions is categorized as Available (Up time) using the +availability criteria as defined by the parameter thresholds that are relevant +for <Service X>. + +Accoring to the ETSI GS NFV-REL 001 V1.1.1 (2015-01) document service +availability in the context of NFV is defined as End-to-End Service availability + +.. (MT) The relevant parts in NFV-REL defines SA as: + +Service Availability refers to the End-to-End Service Availability which +includes all the elements in the end-to-end service (VNFs and infrastructure +components) with the exception of the customer terminal. This is a customer +facing (end user) availability definition and it is the result of accessibility +and #admission control (see their respective definitions above). + +Service Availability=total service available time/ + (total service available time + total restoration time) + +**Service continuity:** Continuous delivery of service in conformance with +service's functional and behavioural specification and SLA requirements, +both in the control and data planes, for any initiated transaction or session +until its full completion even in the events of intervening exceptions or +anomalies, whether scheduled or unscheduled, malicious, intentional +or unintentional. + +The relevant parts in NFV-REL: +The basic property of service continuity is that the same service is provided +during VNF scaling in/out operations, or when the VNF offering that service +needs to be relocated to another site due to an anomaly event +(e.g. CPU overload, hardware failure or security threat). + +**Service failover:** when the instance providing a service/VNF becomes +unavailable due to fault or failure, another instance will (automatically) take +over the service, and this whole process is transparent to the user. It is +possible that an entire VNF instance becomes unavailble while providing its +service. + +.. (MT) I think the service or an instance of it is a logical entity on its own and the service availability and continuity is with respect to this logical entity. For examlpe if a HTTP server serves a given URL, the HTTP server is the provider while that URL is the service it is providing. As long as I have an HTTP server running and serving this URL I have the service available. But no matter how many HTTP servers I'm running if they are not assigned to serve the URL, then it is not available. Unfortunately in the ETSI NFV documents there's not a clear distinction between the service and the provider logical entities. The distinction is more on the level of the different incarnations of the provider entity, i.e. VNF and its instances or VNFC and its instances. I don't know if I'm clear enough and to what extent we should go into this, but I tried to modify the definition along these lines. Now regarding the user perception and whether it's automatic I agreed that we want it automatic and seemless for the user, but I don't think that this is part of the failover definition. If it's done manually or if the user detects it it's still a failover. It's just not seemless. Requiring it being automatic and seemless should be in the requirement section as appropriate. + +.. (fq) Agree. + +**Service failover time:** Service failover is when the instance providing a +service becomes unavailable due to a fault or a failure and another healthy +instance takes over in providing the service. In the HA context this should be +an automatic action and this whole process should be transparent to the user. +It is possible that an entire VNF instance becomes unavailble while providing +its service. + +.. (MT) Aligned with the above I would say that the serice failover time is the time from the moment of detecting the failure of the instance providing the service until the service is provided again by a new instance. + +.. (fq) So in such definition, the time duration for the failure of the service=failure detection time+service failover time. Am I correct? + +.. (bb) I feel, it is; "time duration for failover of the service = failure detection time + service failover time". +.. (MT) I would say that the "failure detection time" + "service failover time" = "service outage time" or actually we defined it below as the "service recovery time" . To reduce the outage we probably can't do much with the "service failover time", it is whatever is needed to perform the failover procedure, so it's tied to the implementation. It's somewhat "given". We may have more control over the detection time as that depends on the frequency of the health-check/heartbeat as this is often configurable. + +.. (fq) Got it. Agree. + +**Failure detection:** If a failure is detected, the failure must be identified +to the component responsible for correction. + +.. (MT) I would rather say "failure detection" as the fault is not detectable until it becomes a failure, even then we may not know where the actual fault is. We only know what failed due to the fault. E.g. we can detect the memory leak, something may crash due to it, but it's much more difficult to figure out where the fault is, i.e. the bug in the software. + +.. (MT) Also I think failures may be detected by different entities in the system, e.g. it could be a monitoring entity, a watchdog, the hypervisor, the VNF itself or a VNF tryng to use the services of a failed VNF. For me all these are failure detections regardless whether they are reported to the VNF. I think from an HA perspective what's important is the error report API(s) that entities should use if they detect a failure they are not in charge of correcting. +.. (fq) Agree. I modify the definition. + +**Failure detection time:** Failure detection time is the time interval from the +moment the failure occurs till it is reported as a detected failure. + +**Alarm:** Alarms are notifications (not queried) that are activated in response +to an event, a set of conditions, or the state of an inventory object. They +also require attention from an entity external to the reporting entity (if not +then the entity should cope with it and not raise the alarm). + +.. (MT) According to NFV-INF 004: Alarms are notifications (not queried) that are activated in response to an event, a set of conditions, or the state of an inventory object. I would add also that they also require attention from an entity external to the reporting entity (if not then the entity should cope with it and not raise the alarm). + +**Alarm threshold condition detection:** Alarm threshold condition is detected +by the component responsible for it. The component periodically evaluates the +condition associated with the alarm and if the threshold is reached, it +generates an alarm on the approprite channel, which in turn delivers it to the +entity(ies) responsible, such as the VIM. + +.. (fq) I don't think the VNF need to know all the alarm. so I use VIM as the terminal point for the alarm detection + +.. (MT) The same way as for the faults/failures, I don't think it's the receiving end that is important but the generatitng end and that it has the right and appropriate channel to communicate the alarm. But I have the impression that you are focusing on a particular type of alarm (i.e. threshold alarm) and not alarms in general. + +.. (fq) Yes, I actully have the threshold alarm in my mind when I wrote this. So I think VIM might be the right receiving end for these alarm. I agree with your ideas about the right channel. I am just not sure whether we should put this part in a high lever perspective or we should define some details. After all OPNFV is an opensource project and we don't want it to be like standarization projects in ETSI. But I agree for the definition part we should have a high level and abstract definition for these, and then we can specify the detail channels in the API definition. + +.. (MT) I tried to modify accordingly. Pls check. I think when it comes to the receiver we don't need to be specific from the detection perspective as usually there is a well-known notification channel that the management entity if it exists would listen to. The alarm detection does not require this entity, it just states that something is wrong and someone should deal with it hence the alarm. + +**Alarm threshold detection time:** the threshold time interval between the +metrics exceeding the threshold and the alarm been detected. + +.. (MT) I assume you are focusing on these threshold alarms, and not alarms in general. +.. (MT) Here similar to the failover time, we may have some control over the detection time (i.e. shorten the evaluation period), but may not on the delivery time. +.. (MT2) I changed "condition" to "threshold" to make it clearer as failure is a "condition" too :-) + +**Service recovery:** The restoration of the service state after the instance of +a service/VNF is unavailable due to fault or failure or manual interuption. + +.. (MT) I think the service recovery is the restoration of the state in which the required function is provided + +**Service recovery time:** Service recovery time is the time interval from the +occurrence of an abnormal event (e.g. failure, manual interruption of service, +etc.) until recovery of the service. + +.. (MT) in NFV-REL: Service recovery time is the time interval from the occurrence of an abnormal event (e.g. failure, manual interruption of service, etc.) until recovery of the service. + +**SAL:** Service Availability Level + +************************ +1.3 Overall requirements +************************ + +Service availability shall be considered with respect to the delivery of end to +end services. + +* There should be no single point of failure in the NFV framework +* All resiliency mechanisms shall be designed for a multi-vendor environment, + where for example the NFVI, NFV-MANO, and VNFs may be supplied by different + vendors. +* Resiliency related information shall always be explicitly specified and + communicated using the reference interfaces (including policies/templates) of + the NFV framework. + +********************* +1.4 Time requirements +********************* + +The time requirements below are examples in order to break out of the failure +detection times considering the service recovery times presented as examples for +the different service availability levels in the ETSI GS NFV-REL 001 V1.1.1 +(2015-01) document. + +The table below maps failure modes to example failure detection times. + ++------------------------------------------------------------+---------------+ +|Failure Mode | Time | ++============================================================+===============+ +|Failure detection of HW | <1s | ++------------------------------------------------------------+---------------+ +|Failure detection of virtual resource | <1s | ++------------------------------------------------------------+---------------+ +|Alarm threshold detection | <1min | ++------------------------------------------------------------+---------------+ +|Failure detection over of SAL 1 | <1s | ++------------------------------------------------------------+---------------+ +|Recovery of SAL 1 | 5-6s | ++------------------------------------------------------------+---------------+ +|Failure detectionover of SAL 2 | <5s | ++------------------------------------------------------------+---------------+ +|Recovery of SAL 2 | 10-15s | ++------------------------------------------------------------+---------------+ +|Failure detectionover of SAL 3 | <10s | ++------------------------------------------------------------+---------------+ +|Recovery of SAL 3 | 20-25s | ++------------------------------------------------------------+---------------+ + diff --git a/Section_2_Hardware_HA.rst b/Section_2_Hardware_HA.rst new file mode 100644 index 0000000..7f4e054 --- /dev/null +++ b/Section_2_Hardware_HA.rst @@ -0,0 +1,186 @@ +=============== +2.0 Hardware HA +=============== + +The hardware HA can be solved by several legacy HA schemes. However, when +considering the NFV scenarios, a hardware failure will cause collateral damage to +not only to the services but also virtual infrastructure running on it. + +A redundant architecture and automatic failover for the hardware are required +for the NFV scenario. At the same time, the fault detection and report of HW +failure from the hardware to VIM, VNFM and if necessary the Orchestrator to achieve HA in OPNFV. A +sample fault table can be found in the Doctor project. (https://wiki.opnfv.org/doctor/faults) +All the critical hardware failures should be reported to the VIM within 1s. + +.. (MT2) Should we keep the 50ms here? Other places have been modified to <1sec, e.g. for SAL 1. + +.. (fq2) agree with 1s + +Other warnings for the hardware should also be reported to the VIM in a +timely manner. + +********************* +General Requirements: +********************* + +.. (MT) Are these general requirements or just for the servers? + +.. (fq) I think these should be the general requirements. not just the server. + +* Hardware Failures should be reported to the hypervisor and the VIM. +* Hardware Failures should not be directly reported to the VNF as in the traditional ATCA + architecture. +* Hardware failure detection message should be sent to the VIM within a specified period of time, + based on the SAL as defined in Section 1. +* Alarm thresholds should be detected and the alarm delivered to the VIM within 1min. A certain + threshold can be set for such notification. +* Direct notification from the hardware to some specific VNF should be possible. + Such notification should be within 1s. +* Periodical update of hardware running conditions (operational state?) to the + NFVI and VIM is required for further operation, which may include fault + prediction, failure analysis, and etc.. Such info should be updated every 60s +* Transparent failover is required once the failure of storage and network + hardware happens. +* Hardware should support SNMP and IPMI for centralized management, monitoring and + control. + +.. (MT) I would assume that this is OK if no guest was impacted, if there was a guest impact I think the VIM etc should know about the issue; in any case logging the failure and its correction would be still important +.. (fq) It seems the hardware failure detection message should send to VIM, shall we delete the hypervisor part? +.. (MT) The reason I asked the question whether this is about the servers was the hypervisor. I agree to remove this from the genaral requirement. +.. (Yifei) Shall we take VIM user (VNFM & NFVO) into consideration? As some of the messages should be send to VIM user. +.. (fq) yifei, I am a little bit confused, do you mean the Hardware send messages directly to VIM user? I myself think this may not be possible? +.. (Yifei) Yes, ur right, they should be sent to VIM first. +.. (MT) I agree, they should be sent to the VIM, the hypervisor can only be conditional because it may not be relevant as in a general requirement or may be dead with the HW. +.. (fq) Agree. I have delete the hypervisor part so that it is not a general requirement. +.. may require realtime features in openstack + +.. (fq) We may need some discussion about the time constraints? including failure detection time, VNF failover time, warning for abnormal situations. A table might be needed to clearify these. Different level of VNF may require differnent failover time. + +.. (MT) I agree. A VNF that manages its own availability with "built-in" redundancy wouldn't really care whether it's 1s or 1min because it would detect the failure and do the failover at the VNF level. But if the availability is managed by the VIM and VNFM then this time becomes critical. + +.. (joe) VIM can only rescue or migrate the VM onto anther host in case of hardware failure. The VNF should have being rescalready finish the failover before the failed/fault VM ued or migrated. VIM's responisbility is to keep the number of alive VM instances required by VNF, even for auto scaling, but not to replacethe VNF failover.That's why hardware failure dection message for VIM is not so time sensitive, because VM creation is often a slow task compared to failover(Althoug a lot of technology to accelerate the VM generation speed or use spare VM pool ). + +.. (fq) Yes. But here we just mean failure detection, not rescue or migration of the VM. I mean the hardware and NFVI failure should be reported to the VIM and the VNF in a timely manner, then the VNF can do the failover, and the VIM can do the migration and rescue afterwards. + +.. (bb) There is confusion regarding time span within which hardware failure should be reported to VIM. In 2nd paragraph(of Hardware HA), it has been mentioned as; "within 50ms" and in this point it is "1s". + +.. (fq) I try to modify the 50ms to 1s. + +.. (chayi) hard for openstack + +.. VNF failover time < 1s + +.. (MT) Indeed, it's not designed for that + +.. (MT) Do the "hardware failure detection message" and the "alarm of hardware failure" refer to the same notification? It may be better to speak about hardware failure detection (and reporting) time. + +.. (fq) I have made the modification. see if it makes sense to you now. + +.. (MT) Based on the definition section I think you are talking about these threshold alarms only, because a failure is also an abnormal situation, but you want to detect it within a second + +.. (fq) Actually, I want to define Alarm as messages that might lead to failure in the near future, for example, a high tempreture, or maybe a prediction of failure. These alarm maybe important, but they do not need to be answered and solved within seconds. + +.. Alarms for abnormal situations and performance decrease (i.e. overuse of cpu) +.. should be raised to the VIM within 1min(?). + + +.. (MT) There should be possible to set some threshold at which the notification should be triggered and probably ceilometer is not reliable enough to deliver such notifications since it has no real-time requirement nor it is expected to be lossless. + +.. (fq) modification made. + +.. (MT) agree with the realtime extension part :-) + +.. (MT) Considering the modified definitions can we say that: Alarm conditions should be detected and the alarm delivered to the VIM within 1min? + +.. This effectively result in two requirements: one on the detection and one on the +.. delivery mechanism. + +.. (fq) Agree. I have made the modification. + + + +.. In the meantime, I see the discussion of +.. this requirement is still open. + +.. (Yifei) As before I do not think it is needed to send HW fault/failure to VNF. For it is different from traditional interated NF, all the lifecycle of VNF is managed by VNFM. + +.. (joe) the HW fault/failure to VNF is required directly for VNF failover purpose. For example, memory or nic failure should be noticed by VNF ASAP, so that the service can be taken over and handled correctly by another VNF instance. + +.. (YY) In what case HW failure to VNF directly?Next is my understanding,may be not correct. If cpu/memory fails hostOS may be crashed at the same time the failure occured then no notification could be send to anywhere. If it is not crashed in some well managed smp OS, and if we use cpu-pinning to VM, the vm guestOS may be crashed. If cpu-pinning is not applied to VM, the hypervisor can continue scheduling the VMs on the server just like over-allocation mode. Another point, to accelerate the failover, the failure should be sent to standby service entity not the failed one. The standby vm should not be in same server because of anti-affinity scheme. How can "direct notice" apply? + +.. (joe) not all HW fault leads to the VNF will be crushed. For example, the nic can not send packet as usual, then it'll affect the service, but the VNF is still running. + + +.. Maybe 10 min is too long. As far as I know, Zabbix which is used by Doctor can +.. achieve 60s. + +.. (fq) change the constraint to 60s + +.. (MT2) I think this applies primarily to storage, network hardware and maybe some controllers, which also run in some type of redundancy e.g. active/active or active/standby. For compute, we need redundancy, but it's more of the spare concept to replace any failed compute in the cluster (e.g. N+1). In this context the failover doesn't mean the recovery of a state, it only means replacing the failed HW with a healthy one in the initial state and that's not transparent at the HW level at least, i.e. the host is not brought up with the same identiy as the failed one. + +.. (fq) agree. I have made some modification. I wonder what controller do you mean? is it SDN controller? + +.. (MT3) Yes, SDN, storage controllers. I don't know if any of the OpenStack controllers would also have such requirement, e.g. Ironic + + + +.. (MT) Is it expected for _all_ hardware? + +.. (YY) As general requirement should we add that the hardware should allow for +.. centralized management and control? Maybe we could be even more specific +.. e.g. what protocol should be supported. + +.. (fq) I agree. as far as I know, the protocol we use for hardware include SNMP and IPMI. + +.. (MT) OK, we can start with those as minimum requirement, i.e. HW should support at least them. Also I think the Ironic project in OpenStack manages the HW and also supports these. I was thinking maybe it could also be used for the HW management although that's not the general goal of Ironic as far as I know. + +*************************** +Network plane Requirements: +*************************** + +* The hardware should provide a redundant architecture for the network plane. +* Failures of the network plane should be reported to the VIM within 1s. +* QoS should be used to protect against link congestion. + +.. (MT) Do you mean the failure of the entire network plane? +.. (fq) no, I mean the failure of the network connection of a certain HW, or a VNF. + +******************** +Power supply system: +******************** + +* The power supply architecture should be redundant at the server and site level. +* Fault of the power supply system should be reported to the VIM within 1s. +* Failure of a power supply will trigure automatic failover to the redundant supply. + +*************** +Cooling system: +*************** + +* The architecture of the cooling system should be redundant. +* Fault of the cooling system should be reported to the VIM within 1s +* Failure of the cooling systme will trigger automatic failover of the system + +*********** +Disk Array: +*********** + +* The architecture for the disk array should be redundant. +* Fault of the disk array should be reported to the VIM within 1s +* Failure of the the disk array will trigger automatic failover of the system + support for protected cache after an unexpected power loss. + +* Data shall be stored redundantly in the storage backend + (e.g., by means of RAID across disks.) +* Upon failures of storage hardware components (e.g., disks services, storage + nodes) automatic repair mechanisms (re-build/re-balance of data) shall be + triggered automatically. +* Centralized storage arrays shall consist of redundant hardware + +******** +Servers: +******** + +* Support precise timming with accuracy higher than 4.6ppm + +.. (MT2) Should we have time synchronization requirements in the other parts? I.e. having NTP in control nodes or even in all hosts |