Docs: Add monitoring cluster related documentation

This patch adds documentation related to deployment, configuration and usage of K8s monitoring cluster. Also adds the devguide explaining mapping of each yaml file with its associated task. Signed-off-by: Aditya Srivastava <adityasrivastava301199@gmail.com> Change-Id: Ib6252f7c853a643eb5cb9f562a55ee366f9c71ea
author: Aditya Srivastava <adityasrivastava301199@gmail.com> 2020-08-24 02:06:54 +0530
committer: Aditya Srivastava <adityasrivastava301199@gmail.com> 2020-09-17 16:57:49 +0530
commit: 8f3d8b3d1072ca33cf3503e95f8fd3bc629ace18 (patch)
tree: dcdf7029d4ccd00daf058e9e534962fc55135098 /docs/lma
parent: 38a2852c84bb9ce692a79d3f1ab941b9f11106a4 (diff)
4 files changed, 704 insertions, 0 deletions
diff --git a/docs/lma/metrics/devguide.rst b/docs/lma/metrics/devguide.rst
new file mode 100644
index 00000000..93d33016
--- /dev/null
+++ b/docs/lma/metrics/devguide.rst
@@ -0,0 +1,474 @@
+====================
+Metrics Dev Guide
+====================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+
+Anible File Organization
+============================
+
+Ansible-Server
+----------------
+
+Please follow the following file structure:
+
+.. code-block:: bash
+
+    ansible-server
+    |   ansible.cfg
+    |   hosts
+    |
+    +---group_vars
+    |       all.yml
+    |
+    +---playbooks
+    |       clean.yaml
+    |       setup.yaml
+    |
+    \---roles
+    +---clean-monitoring
+    |   \---tasks
+    |           main.yml
+    |
+    +---monitoring
+        +---files
+        |   |   monitoring-namespace.yaml
+        |   |
+        |   +---alertmanager
+        |   |       alertmanager-config.yaml
+        |   |       alertmanager-deployment.yaml
+        |   |       alertmanager-service.yaml
+        |   |       alertmanager1-deployment.yaml
+        |   |       alertmanager1-service.yaml
+        |   |
+        |   +---cadvisor
+        |   |       cadvisor-daemonset.yaml
+        |   |       cadvisor-service.yaml
+        |   |
+        |   +---collectd-exporter
+        |   |       collectd-exporter-deployment.yaml
+        |   |       collectd-exporter-service.yaml
+        |   |
+        |   +---grafana
+        |   |       grafana-datasource-config.yaml
+        |   |       grafana-deployment.yaml
+        |   |       grafana-pv.yaml
+        |   |       grafana-pvc.yaml
+        |   |       grafana-service.yaml
+        |   |
+        |   +---kube-state-metrics
+        |   |       kube-state-metrics-deployment.yaml
+        |   |       kube-state-metrics-service.yaml
+        |   |
+        |   +---node-exporter
+        |   |       nodeexporter-daemonset.yaml
+        |   |       nodeexporter-service.yaml
+        |   |
+        |   \---prometheus
+        |           main-prometheus-service.yaml
+        |           prometheus-config.yaml
+        |           prometheus-deployment.yaml
+        |           prometheus-pv.yaml
+        |           prometheus-pvc.yaml
+        |           prometheus-service.yaml
+        |           prometheus1-deployment.yaml
+        |           prometheus1-service.yaml
+        |
+        \---tasks
+               main.yml
+
+
+Ansible - Client
+------------------
+
+Please follow the following file structure:
+
+.. code-block:: bash
+
+    ansible-server
+    |   ansible.cfg
+    |   hosts
+    |
+    +---group_vars
+    |       all.yml
+    |
+    +---playbooks
+    |       clean.yaml
+    |       setup.yaml
+    |
+    \---roles
+        +---clean-collectd
+        |   \---tasks
+        |           main.yml
+        |
+        +---collectd
+            +---files
+            |       collectd.conf.j2
+            |
+            \---tasks
+                    main.yml
+
+
+Summary of Roles
+==================
+
+A brief description of the Ansible playbook roles,
+which are used to deploy the  monitoring cluster
+
+Ansible Server Roles
+----------------------
+
+Ansible Server, this part consists of the roles used to deploy
+Prometheus Alertmanager Grafana stack on the server-side
+
+Role: Monitoring
+~~~~~~~~~~~~~~~~~~
+
+Deployment and configuration of PAG stack along with collectd-exporter,
+cadvisor and node-exporter.
+
+Role: Clean-Monitoring
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Removes all the components deployed by the Monitoring role.
+
+
+File-Task Mapping and Configurable Parameters
+================================================
+
+Ansible Server
+----------------
+
+Role: Monitoring
+~~~~~~~~~~~~~~~~~~~
+
+Alert Manager
+^^^^^^^^^^^^^^^
+
+File: alertmanager-config.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-config.yaml
+
+Task: Configures Receivers for alertmanager
+
+Summary: A configmap, currently configures webhook for alertmanager,
+can be used to configure any kind of receiver
+
+Configurable Parameters:
+    receiver.url: change to the webhook receiver's URL
+    route: Can be used to add receivers
+
+
+File: alertmanager-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-deployment.yaml
+
+Task: Deploys alertmanager instance
+
+Summary: A Deployment, deploys 1 replica of alertmanager
+
+
+File: alertmanager-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-service.yaml
+
+Task: Creates a K8s service for alertmanager
+
+Summary: A Nodeport type of service, so that user can create "silences",
+view the status of alerts from the native alertmanager dashboard / UI.
+
+Configurable Parameters:
+    spec.type: Options : NodePort, ClusterIP, LoadBalancer
+    spec.ports: Edit / add ports to be handled by the service
+
+**Note: alertmanager1-deployment, alertmanager1-service are the same as
+alertmanager-deployment and alertmanager-service respectively.**
+
+CAdvisor
+^^^^^^^^^^^
+
+File: cadvisor-daemonset.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/cadvisor/cadvisor-daemonset.yaml
+
+Task: To create a cadvisor daemonset
+
+Summary: A daemonset, used to scrape data of the kubernetes cluster itself,
+its a daemonset so an instance is run on every node.
+
+Configurable Parameters:
+    spec.template.spec.ports: Port of the container
+
+
+File: cadvisor-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/cadvisor/cadvisor-service.yaml
+
+Task: To create a cadvisor service
+
+Summary: A ClusterIP service for cadvisor to communicate with prometheus
+
+Configurable Parameters:
+    spec.ports: Add / Edit ports
+
+
+Collectd Exporter
+^^^^^^^^^^^^^^^^^^^^
+
+File: collectd-exporter-deployment.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml
+
+Task: To create a collectd replica
+
+Summary: A deployment, acts as receiver for collectd data sent by client machines,
+prometheus pulls data from this exporter
+
+Configurable Parameters:
+    spec.template.spec.ports: Port of the container
+
+
+File: collectd-exporter.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/collectd-exporter/collectd-exporter.yaml
+
+Task: To create a collectd service
+
+Summary: A NodePort service for collectd-exporter to hold data for prometheus
+to scrape
+
+Configurable Parameters:
+    spec.ports: Add / Edit ports
+
+
+Grafana
+^^^^^^^^^
+
+File: grafana-datasource-config.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-datasource-config.yaml
+
+Task: To create config file for grafana
+
+Summary: A configmap, adds prometheus datasource in grafana
+
+
+File: grafana-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-deployment.yaml
+
+Task: To create a grafana deployment
+
+Summary: The grafana deployment creates a single replica of grafana,
+with preconfigured prometheus datasource.
+
+Configurable Parameters:
+    spec.template.spec.ports: Edit ports
+    spec.template.spec.env: Add / Edit environment variables
+
+
+File: grafana-pv.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-pv.yaml
+
+Task: To create a persistent volume for grafana
+
+Summary: A persistent volume for grafana.
+
+Configurable Parameters:
+    spec.capacity.storage: Increase / decrease size
+    spec.accessModes: To change the way PV is accessed.
+    spec.nfs.server: To change the ip address of NFS server
+    spec.nfs.path: To change the path of the server
+
+
+File: grafana-pvc.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-pvc.yaml
+
+Task: To create a persistent volume claim for grafana
+
+Summary: A persistent volume claim for grafana.
+
+Configurable Parameters:
+    spec.resources.requests.storage: Increase / decrease size
+
+
+File: grafana-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-service.yaml
+
+Task: To create a service for grafana
+
+Summary: A Nodeport type of service, so that users actually connect to,
+view the dashboard / UI.
+
+Configurable Parameters:
+    spec.type: Options : NodePort, ClusterIP, LoadBalancer
+    spec.ports: Edit / add ports to be handled by the service
+
+
+Kube State Metrics
+^^^^^^^^^^^^^^^^^^^^
+
+File: kube-state-metrics-deployment.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/kube-state-metrics/kube-state-metrics-deployment.yaml
+
+Task: To create a kube-state-metrics instance
+
+Summary: A deployment, used to collect metrics of the kubernetes cluster iteself
+
+Configurable Parameters:
+    spec.template.spec.containers.ports: Port of the container
+
+
+File: kube-state-metrics-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/kube-state-metrics/kube-state-metrics-service.yaml
+
+Task: To create a collectd service
+
+Summary: A NodePort service for collectd-exporter to hold data for prometheus
+to scrape
+
+Configurable Parameters:
+    spec.ports: Add / Edit ports
+
+
+Node Exporter
+^^^^^^^^^^^^^^^
+
+File: node-exporter-daemonset.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/node-exporter/node-exporter-daemonset.yaml
+
+Task: To create a node exporter daemonset
+
+Summary: A daemonset, used to scrape data of the host machines / node,
+its a daemonset so an instance is run on every node.
+
+Configurable Parameters:
+    spec.template.spec.ports: Port of the container
+
+
+File: node-exporter-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/node-exporter/node-exporter-service.yaml
+
+Task: To create a node exporter service
+
+Summary: A ClusterIP service for node exporter to communicate with Prometheus
+
+Configurable Parameters:
+    spec.ports: Add / Edit ports
+
+
+Prometheus
+^^^^^^^^^^^^^
+
+File: prometheus-config.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-config.yaml
+
+Task: To create a config file for Prometheus
+
+Summary: A configmap, adds alert rules.
+
+Configurable Parameters:
+    data.alert.rules: Add / Edit alert rules
+
+
+File: prometheus-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-deployment.yaml
+
+Task: To create a Prometheus deployment
+
+Summary: The Prometheus deployment creates a single replica of Prometheus,
+with preconfigured Prometheus datasource.
+
+Configurable Parameters:
+    spec.template.spec.affinity: To change the node affinity,
+                                 make sure only 1 instance of prometheus is
+                                 running on 1 node.
+
+    spec.template.spec.ports: Add / Edit container port
+
+
+File: prometheus-pv.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-pv.yaml
+
+Task: To create a persistent volume for Prometheus
+
+Summary: A persistent volume for Prometheus.
+
+Configurable Parameters:
+    spec.capacity.storage: Increase / decrease size
+    spec.accessModes: To change the way PV is accessed.
+    spec.hostpath.path: To change the path of the volume
+
+
+File: prometheus-pvc.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-pvc.yaml
+
+Task: To create a persistent volume claim for Prometheus
+
+Summary: A persistent volume claim for Prometheus.
+
+Configurable Parameters:
+    spec.resources.requests.storage: Increase / decrease size
+
+
+File: prometheus-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-service.yaml
+
+Task: To create a service for prometheus
+
+Summary: A Nodeport type of service, prometheus native dashboard
+available here.
+
+Configurable Parameters:
+    spec.type: Options : NodePort, ClusterIP, LoadBalancer
+    spec.ports: Edit / add ports to be handled by the service
+
+
+File: main-prometheus-server.yaml
+'''''''''''''''''''''''''''''''''''
+Path: monitoring/files/prometheus/main-prometheus-service.yaml
+
+Task: A service that connects both prometheus instances.
+
+Summary: A Nodeport service for other services to connect to the Prometheus cluster.
+As HA Prometheus needs to independent instances of Prometheus scraping the same inputs
+having the same configuration
+
+**Note: prometheus-deployment, prometheus1-service are the same as
+prometheus-deployment and prometheus-service respectively.**
+
+
+Ansible Client Roles
+----------------------
+
+Role: Collectd
+~~~~~~~~~~~~~~~~~~
+
+File: main.yml
+^^^^^^^^^^^^^^^^
+Path: collectd/tasks/main.yaml
+
+Task: Install collectd along with prerequisites
+
+Associated template file:
+
+- collectd.conf.j2
+Path: collectd/files/collectd.conf.j2
+
+Summary: Edit this file to change the default configuration to
+be installed on the client's machine
diff --git a/docs/lma/metrics/images/dataflow.png b/docs/lma/metrics/images/dataflow.png
new file mode 100644
index 00000000..ca1ec908
--- /dev/null
+++ b/docs/lma/metrics/images/dataflow.png
diff --git a/docs/lma/metrics/images/setup.png b/docs/lma/metrics/images/setup.png
new file mode 100644
index 00000000..ce6a1274
--- /dev/null
+++ b/docs/lma/metrics/images/setup.png
diff --git a/docs/lma/metrics/userguide.rst b/docs/lma/metrics/userguide.rst
new file mode 100644
index 00000000..0ee4a238
--- /dev/null
+++ b/docs/lma/metrics/userguide.rst
@@ -0,0 +1,230 @@
+=================
+Metrics
+=================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+Setup
+=======
+
+Prerequisites
+-------------------------
+- Require 3 VMs to setup K8s
+- ``$ sudo yum install ansible``
+- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module)
+- Update IPs in all these files (if changed)
+    - ``ansible-server/group_vars/all.yml`` (IP of apiserver and hostname)
+    - ``ansible-server/hosts`` (IP of VMs to install)
+    - ``ansible-server/roles/monitoring/files/grafana/grafana-pv.yaml`` (IP of NFS-Server)
+    - ``ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` (IP of alert-receiver)
+
+Setup Structure
+---------------
+.. image:: images/setup.png
+
+Installation - Client Side
+----------------------------
+
+Nodes
+`````
+- **Node1** = 10.10.120.21
+- **Node4** = 10.10.120.24
+
+How installation is done?
+`````````````````````````
+Ansible playbook available in ``tools/lma/ansible-client`` folder
+
+- ``cd tools/lma/ansible-client``
+- ``ansible-playbook setup.yaml``
+
+This deploys collectd and configures it to send data to collectd exporter
+configured at 10.10.120.211 (ip address of current instance of collectd-exporter)
+Please make appropriate changes in the config file present in ``tools/lma/ansible-client/roles/collectd/files/``
+
+Installation - Server Side
+----------------------------
+
+Nodes
+``````
+
+Inside Jumphost - POD12
+   - **VM1** = 10.10.120.211
+   - **VM2** = 10.10.120.203
+   - **VM3** = 10.10.120.204
+
+
+How installation is done?
+`````````````````````````
+**Using Ansible:**
+   - **K8s**
+      - **Prometheus:** 2 independent deployments
+      - **Alertmanager:** 2 independent deployments (cluster peers)
+      - **Grafana:** 1 Replica deployment
+      - **cAdvisor:** 1 daemonset, i.e 3 replicas, one on each node
+      - **collectd-exporter:** 1 Replica
+      - **node-exporter:** 1 statefulset with 3 replicas
+      - **kube-state-metrics:** 1 deployment
+   - **NFS Server:** at each VM to store grafana data at following path
+      - ``/usr/share/monitoring_data/grafana``
+
+How to setup?
+`````````````
+- **To setup K8s cluster, EFK and PAG:** Run the ansible-playbook ``ansible/playbooks/setup.yaml``
+- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml``
+
+Do we have HA?
+````````````````
+Yes
+
+Configuration
+=============
+
+K8s
+---
+Path to all yamls (Server Side)
+````````````````````````````````
+``tools/lma/ansible-server/roles/monitoring/files/``
+
+K8s namespace
+`````````````
+``monitoring``
+
+Configuration
+---------------------------
+
+Serivces and Ports
+``````````````````````````
+
+Services and their ports are listed below,
+one can go to IP of any node on the following ports,
+service will correctly redirect you
+
+
+  ======================       =======
+      Service                   Port
+  ======================       =======
+     Prometheus                 30900
+     Prometheus1                30901
+     Main-Prometheus            30902
+     Alertmanager               30930
+     Alertmanager1              30931
+     Grafana                    30000
+     Collectd-exporter          30130
+  ======================       =======
+
+How to change Configuration?
+------------------------------
+- Ports, names of the containers, pretty much every configuration can be modified by changing the required values in the respective yaml files (``/tools/lma/ansible-server/roles/monitoring/``)
+- For metrics, on the client's machine, edit the collectd's configuration (jinja2 template) file, and add required plugins (``/tools/lma/ansible-client/roles/collectd/files/collectd.conf.j2``).
+  For more details refer `this <https://collectd.org/wiki/index.php/First_steps>`_
+
+Where to send metrics?
+------------------------
+
+Metrics are sent to collectd exporter.
+UDP packets are sent to port 38026
+(can be configured and checked at
+``tools/lma/ansible-server/roles/monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml``)
+
+Data Management
+================================
+
+DataFlow:
+--------------
+.. image:: images/dataFlow.png
+
+Where is the data stored now?
+----------------------------------
+  - Grafana data (including dashboards) ==> On master, at ``/usr/share/monitoring_data/grafana`` (its accessed by Presistent volume via NFS)
+  - Prometheus Data ==> On VM2 and VM3, at /usr/share/monitoring_data/prometheus
+
+  **Note: Promethei data also are independent of each other, a shared data solution gave errors**
+
+Do we have backup of data?
+-------------------------------
+  Promethei even though independent scrape same targets,
+  have same alert rules, therefore generate very similar data.
+
+  Grafana's NFS part of the data has no backup
+  Dashboards' json are available in the ``/tools/lma/metrics/dashboards`` directory
+
+When containers are restarted, the data is still accessible?
+-----------------------------------------------------------------
+  Yes, unless the data directories are deleted ``(/usr/share/monitoring_data/*)`` from each node
+
+Alert Management
+==================
+
+Configure Alert receiver
+--------------------------
+- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml``
+- Under the config.yml section under receivers, add, update, delete receivers
+- Currently ip of unified alert receiver is used.
+- Alertmanager supports multiple types of receivers, you can get a `list here <https://prometheus.io/docs/alerting/latest/configuration/>`_
+
+Add new alerts
+--------------------------------------
+- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/prometheus/prometheus-config.yaml``
+- Under the data section alert.rules file is mounted on the config-map.
+- In this file alerts are divided in 4 groups, namely:
+        - targets
+        - host and hardware
+        - container
+        - kubernetes
+- Add alerts under exisiting group or add new group. Please follow the structure of the file for adding new group
+- To add new alert:
+    - Use the following structure:
+
+               alert: alertname
+
+               expr: alert rule (generally promql conditional query)
+
+               for: time-range (eg. 5m, 10s, etc, the amount of time the condition needs to be true for the alert to be triggered)
+
+               labels:
+
+                      severity: critical (other severity options and other labels can be added here)
+
+                      type: hardware
+
+               annotations:
+
+                      summary: <summary of the alert>
+
+                      description: <descibe the alert here>
+
+- For an exhaustive alerts list you can have a look `here <https://awesome-prometheus-alerts.grep.to/>`_
+
+Troubleshooting
+===============
+No metrics received in grafana plot
+---------------------------------------------
+- Check if all configurations are correctly done.
+- Go to main-prometheus's port and any one VMs' ip, and check if prometheus is getting the metrics
+- If prometheus is getting them, read grafana's logs (``kubectl -n monitoring logs <name_of_grafana_pod>``)
+- Else, have a look at collectd exporter's metrics endpoint (eg. 10.10.120.211:30103/metrics)
+- If collectd is getting them, check prometheus's config file if collectd's ip is correct over there.
+- Else ssh to master, check which node collectd-exporter is scheduled (lets say vm2)
+- Now ssh to vm2
+- Use ``tcpdump -i ens3 #the interface used to connect to the internet > testdump``
+- Grep your client node's ip and check if packets are reaching our monitoring cluster (``cat testdump | grep <ip of client>``)
+- Ideally you should see packets reaching the node, if so please see if the collectd-exporter is running correctly, check its logs.
+- If no packets are received, error is on the client side, check collectd's config file and make sure correct collectd-exporter ip is used in the ``<network>`` section.
+
+If no notification received
+---------------------------
+- Go to main-prometheus's port and any one VMs' ip,(eg. 10.10.120.211:30902) and check if prometheus is getting the metrics
+- If no, read "No metrics received in grafana plot" section, else read ahead.
+- Check IP of alert-receiver, you can see this by going to alertmanager-ip:port and check if alertmanager is configured correctly.
+- If yes, paste the alert rule in the prometheus' query-box and see if any metric staisfy the condition.
+- You may need to change alert rules in the alert.rules section of prometheus-config.yaml if there was a bug in the alert's rule. (please read the "Add new alerts" section for detailed instructions)
+
+Reference
+=========
+- `Prometheus K8S deployment <https://www.metricfire.com/blog/how-to-deploy-prometheus-on-kubernetes/>`_
+- `HA Prometheus <https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available>`_
+- `Data Flow Diagram <https://drive.google.com/file/d/1D--LXFqU_H-fqpD57H3lJFOqcqWHoF0U/view?usp=sharing>`_
+- `Collectd Configuration <https://docs.opnfv.org/en/stable-fraser/submodules/barometer/docs/release/userguide/docker.userguide.html#build-the-collectd-docker-image>`_
+- `Alertmanager Rule Config <https://awesome-prometheus-alerts.grep.to/>`_
author	Aditya Srivastava <adityasrivastava301199@gmail.com>	2020-08-24 02:06:54 +0530
committer	Aditya Srivastava <adityasrivastava301199@gmail.com>	2020-09-17 16:57:49 +0530
commit	8f3d8b3d1072ca33cf3503e95f8fd3bc629ace18 (patch)
tree	dcdf7029d4ccd00daf058e9e534962fc55135098 /docs/lma
parent	38a2852c84bb9ce692a79d3f1ab941b9f11106a4 (diff)