diff options
Diffstat (limited to 'docs/release/Calipso-usage-stories.rst')
-rw-r--r-- | docs/release/Calipso-usage-stories.rst | 446 |
1 files changed, 446 insertions, 0 deletions
diff --git a/docs/release/Calipso-usage-stories.rst b/docs/release/Calipso-usage-stories.rst new file mode 100644 index 0000000..4c0c753 --- /dev/null +++ b/docs/release/Calipso-usage-stories.rst @@ -0,0 +1,446 @@ +***The following are fake stories, although providing real examples of +real problems that are faced today by cloud providers, and showing +possible resolutions provided by Calipso:*** + +***Enterprise use-case story (Calipso ‘S’ release):*** + +Moz is a website publishing and managing product, Moz provides +reputation and popularity tracking, helps with distributions, listing, +and ratings and provides content distributions for industry marketing. + +Moz considers moving their main content distribution application to be +hosted on https://www.dreamhost.com/, which provides shared and +dedicated IaaS and PaaS hosting based on OpenStack. + +As a major milestone for Moz’s due diligence for choosing Dreamhost, Moz +acquires a shared hosting facility from Dreamhost, that is +cost-effective and stable, it includes 4 mid-sized Web-servers, 4 +large-sized Application-servers and 2 large-sized DB servers, connected +using several networks, with some security services. + +Dreamhost executives instruct their infrastructure operations department +to make sure proper SLA and Monitoring is in-place so the due diligence +and final production deployment of Moz’s services in the Dreamhost +datacenter goes well and that Moz’s engineers receive excellent service +experience. + +Moz received the following SLA with their current VPS contract: + +- 97-day money back guarantee, in case of a single service down event + or any dissatisfaction. + +- 99.5 % uptime/availability with a weekly total downtime of 30 + minutes. + +- 24/7/365 on-call service with a total of 6 hours MTTR. + +- Full HA for all networking services. + +- Managed VPS using own Control Panel IaaS provisioning with overall + health visibility. + +- Scalable RAM, starts with 1GB can grow per request to 16GB from + within control panel. + +- Guaranteed usage of SSD or equivalent speeds, storage capacity from + 30GB to 240GB. + +- Backup service based on cinder-backup and Ceph’s dedicated backup + volumes, with restoration time below 4 hours. + +Dreamhost‘s operations factored all requirement and has decided to +include real-time monitoring and analysis for the VPS for Moz. + +One of the tools used now for Moz environment in Dreamhost is Calipso +for virtual networking. + +Here are some benefits provided by Calipso for Dreamhost operations +during service cycles: + +*Reporting:* + +Special handling of virtual networking is in place: + +- Dreamhost designed a certain virtual networking setup and + connectivity that provides the HA and performance required by the SLA + and decided on several physical locations for Moz’s virtual servers + in different availability zones. + +- Scheduling of discovery has been created, Calipso takes a snapshot of + Moz’s environment every Sunday at midnight, reporting on connectivity + among all 20 servers (10 main and 10 backups) and overall health of + that connectivity. + +- Every Sunday morning at 8am, before the week’s automatic + snapshotting, the NOC administrator runs a manual discovery and saves + that snapshot, she then runs a comparison check against last week’s + snapshot and against initial design to find any gaps or changes that + might happen due to other shared services deployments, virtual + instances and their connectivity are analyzed and reported with + Calipso’s topology and health monitoring. + +- Reports are saved for a bi-weekly reporting sent to Moz’s networking + engineers. + + *Change management:* + + If infrastructure changes needs to happen on any virtual service + (routers, switches, firewalls etc.) or on any physical server or + physical switch the following special guidelines apply: + +- Run a search on Calipso for the name of the virtual service, switch + or host. Lookup if Moz environment is using this object (using the + object’s attributes). + +- Using Calipso’s impact analysis, fill a report stating all Moz’s + objects, on which host, connected to which switch that is affected by + the planed change. + +- Run clique-type scan, using the specific object as ‘focal-point’ to + create a dedicated topology with accompanied health report before + conducting the change itself, use this a *pre snapshot*. + +- Simulate the change, using Moz’s testing environment only, make sure + HA services are in places and downtime is confirmed to be in the SLA + boundaries. + +- Using all reports provided by Calipso, along with application and + storage reports, send a detailed change request to NOC and later to + the end-customer for review. + +- During the change, make sure HA is operational, by running the same + clique-type snapshotting every 10 minutes and running a comparison. + +- NOC, while waiting for the change to complete, looks at Calipso’s + dashboard focused on MOZ’s environment, monitoring results for + service down event (as expected), impact on other objects in the + service chain - the entire Calipso clique for that object (as + expected). + +- Once operations has reported back to NOC about change done, run the + same snapshotting again as *post snapshot* and run a comparison to + make sure all virtual networking are back to the ‘as designed’ stage + and all networking services are back. + +**Example snapshot taken at one stage on Calipso for the Moz virtual +networking:** + +|image0| + + *Troubleshooting:* + + Dreamhost NOC uses Calipso dashboards for Moz’s environment for + their daily health-check. Troubleshooting starts in two cases: + +1. When a failure is detected on Calipso for any of Moz’s objects on + their virtual networking topologies, + +2. When a service case has been opened by Moz with “High Priority, + service down” flag. + +3. Networking department needs to know which virtual services are + connected to which ACI switches ports. + + The following actions are taken, using Calipso dashboards: + +- Kickoff a discovery through Calipso API for all objects related to + Moz. + +- For a service request with no Calipso error detected: using Calipso’s + impact analysis, create all cliques for all objects as focal point. + +- For an error detected by Calipso: using Calipso’s impact analysis, + create cliques for objects with errors as focal point. + +- Resulted cliques are then analyzed using detailed messaging facility + in Calipso (looking deeply into any message generated regarding the + related objects). + +- Report with ACI ports to virtual services mappings is sent to + networking department for further analysis. + + |image1| + +- If this is a failure on any physical device (host or switch) and/or + on any physical NIC (switch or host side), Calipso immediately points + this out and using the specific set of messages generated the + administrator can figure out the root cause (like optical failure, + driver, disconnect etc.). + +- In virtual object failures Calipso saves time pinpointing the servers + where erroneous objects are running, and their previous and new + connectivity details. + +- Calipso alerts on dependencies for : + +1. All related objects in the clique for that objects. + +2. Related hosts + +3. Related projects and networks + +4. Related application (\* in case Murano app has been added) + +- Administrators connects directly to the specific servers and now, + using the specific object attributes can start he’s manual + troubleshooting (actual fixing of the software issues is not + currently part of the Calipso features). + +- The NOC operators approves closing the service ticket only when all + related Calipso cliques are showing up as healthy and connectivity is + back to it’s original “as designed” stage, using Calipso older + snapshots. + +**Lookup of message – to – graph object in messaging facility:** + +|image2| + +**Finding the right object related to a specific logging/monitoring +message**: + +|image3| + +***Service Provider use-case story (Calipso ‘P’ release):*** + +BoingBoing is a specialized video casting service and blogging site. It +is using several locations to run their service (regional hubs and +central corporate campus, some hosted and some are private). + +BoingBoing contracted AT&T to build an NFV service for them, deployed on +2 new hosted regional hubs, to be brought up dynamically for special +sporting, news or cloture events. On each one of the 2 hosted virtual +environments the following service chain is created: + +1. Two vyatta 5600 virtual routers are front-end routing aggregation + function. + +2. Two Steelhead virtual wan acceleration appliances connected to + central campus for accelerating and caching of video casting + services. + +3. Two f5 BIG-IP Traffic Management (load balancing) virtual appliances. + +4. Two Cisco vASA for virtual firewall and remote-access VPN services. + +As a major milestone for BoingBoing’s due diligence for choosing AT&T +NFV service, BoingBoing acquires 2 shared hosting facilities and +automatic service from AT&T, that is cost-effective and stable, it +includes This NFV service consist of a total of 16 virtual appliance +across those 2 sites, to be created on-demand and maintained with a +certain SLA once provisioned, all NFV devices are connected using +several networks, provisioned using VPP ml2 on an OpenStack based +environment.. + +AT&T executives instruct their infrastructure operations department to +make sure proper SLA and Monitoring is in-place so the due diligence and +final production deployment of BoingBoing’s services in the AT&T +datacenters goes well and that BoingBoing’s engineers receive excellent +service experience. + +BoingBoing received the following SLA with their current VPS contract: + +- 30-day money back guarantee, in case of a single service down event + or any dissatisfaction. + +- 99.9 % uptime/availability with a weekly total downtime of 10 + minutes. + +- 24/7/365 on-call service with a total of 2 hours MTTR. + +- Full HA for all networking services. + +- Managed service using Control Panel IaaS provisioning with overall + health visibility. + +- Dedicated RAM, from16GB to 64GB from within control panel. + +- Guaranteed usage of SSD or equivalent speeds, storage capacity from + 10GB to 80GB. + +- Backup service based on cinder-backup and Ceph’s dedicated backup + volumes, with restoration time below 4 hours. + +- End-to-end throughput from central campus to dynamically created + regional sites to be always above 2Gbps, including all devices on the + service chain and the virtual networking in place. + +AT&T’s operations factored all requirement and has decided to include +real-time monitoring and analysis for the NFV environment for +BoingBoing. + +One of the tools used now for BoingBoing environment in AT&T is Calipso +for virtual networking. + +Here are some benefits provided by Calipso for AT&T operations during +service cycles: + +*Reporting:* + +Special handling of virtual networking is in place: + +- AT&T designed a certain virtual networking (SFC) setup and + connectivity that provides the HA and performance required by the SLA + and decided on several physical locations for BoingBoing’s virtual + appliances in different availability zones. + +- Scheduling of discovery has been created, Calipso takes a snapshot of + BoingBoing’s environment every Sunday at midnight, reporting on + connectivity among all 16 instances (8 per regional site, 4 pairs on + each) and overall health of that connectivity. + +- Every Sunday morning at 8am, before the week’s automatic + snapshotting, the NOC administrator runs a manual discovery and saves + that snapshot, she then runs a comparison check against last week’s + snapshot and against initial design to find any gaps or changes that + might happen due to other shared services deployments, virtual + instances and their connectivity are analyzed and reported with + Calipso’s topology and health monitoring. + +- Reports are saved for a bi-weekly reporting sent to BoingBoing’s + networking engineers. + +- Throughput is measured by a special traffic sampling technology + inside the VPP virtual switches and sent back to Calipso for + references to virtual objects and topological inventory. Dependencies + are analyzed so SFC topologies are now visualized across all sites + and includes graphing facility on the Calipso UI to visualize the + throughput. + + *Change management:* + + If infrastructure changes needs to happen on any virtual service + (NFV virtual appliances, internal routers, switches, firewalls etc.) + or on any physical server or physical switch the following special + guidelines apply: + +- Run a lookup on Calipso search-engine for the name of the virtual + service, switch or host, including names of NFV appliances as updated + in the Calipso inventory by the NFV provisioning application. Lookup + if BoingBoing environment is using this object (using the object’s + attributes). + + **Running a lookup on Calipso search-engine** + +|image4| + +- Using Calipso’s impact analysis, fill a report stating all + BoingBoing’s objects, on which host, connected to which switch that + is affected by the planed change. + +- Run clique-type scan, using the specific object as ‘focal-point’ to + create a dedicated topology with accompanied health report before + conducting the change itself, use this a *pre snapshot*. + +- Simulate the change, using BoingBoing’s testing environment only, + make sure HA services are in places and downtime is confirmed to be + in the SLA boundaries. + +- Using all reports provided by Calipso, along with application and + storage reports, send a detailed change request to NOC and later to + the end-customer for review. + +- During the change, make sure HA is operational, by running the same + clique-type snapshotting every 10 minutes and running a comparison. + +- NOC, while waiting for the change to complete, looks at Calipso’s + dashboard focused on BoingBoing’s environment, monitoring results for + SFC service down event (as expected), impact on other objects in the + service chain - the entire Calipso clique for that object (as + expected). + +- Once operations has reported back to NOC about change done, run the + same snapshotting again as *post snapshot* and run a comparison to + make sure all virtual networking are back to the ‘as designed’ stage + and all networking services are back. + +**Example snapshot taken at one stage for the BoingBoing virtual +networking and SFC:** + +|image5| + + *Troubleshooting:* + + AT&T NOC uses Calipso dashboards for BoingBoing’s environment for + their daily health-check. Troubleshooting starts in two cases: + +1. When a failure is detected on Calipso for any of BoingBoing’s objects + on their virtual networking topologies, + +2. When a service case has been opened by BoingBoing with “High + Priority, SFC down” flag. + + The following actions are taken, using Calipso dashboards: + +- Kickoff a discovery through Calipso API for all objects related to + BoingBoing. + +- For a service request with no Calipso error detected: using Calipso’s + impact analysis, create all cliques for all objects as focal point. + +- For an error detected by Calipso: using Calipso’s impact analysis, + create cliques for objects with errors as focal point. + +- Resulted cliques are then analyzed using detailed messaging facility + in Calipso (looking deeply into any message generated regarding the + related objects). + +- If this is a failure on any physical device (host or switch) and/or + on any physical NIC (switch or host side), Calipso immediately points + this out and using the specific set of messages generated the + administrator can figure out the root cause (like optical failure, + driver, disconnect etc.). + +- In virtual object failures Calipso saves time pinpointing the servers + where erroneous objects are running, and their previous and new + connectivity details. + +- \*Sources of alerts ...OpenStack, Calipso’s and Sensu are built-in + sources, other NFV related monitoring and alerting sources can be + added to Calipso messaging system. + +- Calipso alerts on dependencies for : + +1. All related objects in the clique for that objects. + +2. Related hosts + +3. Related projects and networks + +4. Related NFV service and SFC (\* in case NFV tacker has been added) + +- Administrators connects directly to the specific servers and now, + using the specific object attributes can start he’s manual + troubleshooting (actual fixing of the software issues is not + currently part of the Calipso features). + +- The NOC operators approves closing the service ticket only when all + related Calipso cliques are showing up as healthy and connectivity is + back to it’s original “as designed” stage, using Calipso older + snapshots. + +**Calipso’s monitoring dashboard shows virtual services are back to +operational state:** + +|image6| + +.. |image0| image:: media/image101.png + :width: 7.14372in + :height: 2.84375in +.. |image1| image:: media/image102.png + :width: 6.99870in + :height: 2.87500in +.. |image2| image:: media/image103.png + :width: 6.50000in + :height: 0.49444in +.. |image3| image:: media/image104.png + :width: 6.50000in + :height: 5.43472in +.. |image4| image:: media/image105.png + :width: 7.24398in + :height: 0.77083in +.. |image5| image:: media/image106.png + :width: 6.50000in + :height: 3.58611in +.. |image6| image:: media/image107.png + :width: 7.20996in + :height: 2.94792in |