diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/lma/metrics/devguide.rst | 474 | ||||
-rw-r--r-- | docs/lma/metrics/images/dataflow.png | bin | 0 -> 42443 bytes | |||
-rw-r--r-- | docs/lma/metrics/images/setup.png | bin | 0 -> 15019 bytes | |||
-rw-r--r-- | docs/lma/metrics/userguide.rst | 230 |
4 files changed, 704 insertions, 0 deletions
diff --git a/docs/lma/metrics/devguide.rst b/docs/lma/metrics/devguide.rst new file mode 100644 index 00000000..93d33016 --- /dev/null +++ b/docs/lma/metrics/devguide.rst @@ -0,0 +1,474 @@ +==================== +Metrics Dev Guide +==================== +Table of Contents +================= +.. contents:: +.. section-numbering:: + + +Anible File Organization +============================ + +Ansible-Server +---------------- + +Please follow the following file structure: + +.. code-block:: bash + + ansible-server + | ansible.cfg + | hosts + | + +---group_vars + | all.yml + | + +---playbooks + | clean.yaml + | setup.yaml + | + \---roles + +---clean-monitoring + | \---tasks + | main.yml + | + +---monitoring + +---files + | | monitoring-namespace.yaml + | | + | +---alertmanager + | | alertmanager-config.yaml + | | alertmanager-deployment.yaml + | | alertmanager-service.yaml + | | alertmanager1-deployment.yaml + | | alertmanager1-service.yaml + | | + | +---cadvisor + | | cadvisor-daemonset.yaml + | | cadvisor-service.yaml + | | + | +---collectd-exporter + | | collectd-exporter-deployment.yaml + | | collectd-exporter-service.yaml + | | + | +---grafana + | | grafana-datasource-config.yaml + | | grafana-deployment.yaml + | | grafana-pv.yaml + | | grafana-pvc.yaml + | | grafana-service.yaml + | | + | +---kube-state-metrics + | | kube-state-metrics-deployment.yaml + | | kube-state-metrics-service.yaml + | | + | +---node-exporter + | | nodeexporter-daemonset.yaml + | | nodeexporter-service.yaml + | | + | \---prometheus + | main-prometheus-service.yaml + | prometheus-config.yaml + | prometheus-deployment.yaml + | prometheus-pv.yaml + | prometheus-pvc.yaml + | prometheus-service.yaml + | prometheus1-deployment.yaml + | prometheus1-service.yaml + | + \---tasks + main.yml + + +Ansible - Client +------------------ + +Please follow the following file structure: + +.. code-block:: bash + + ansible-server + | ansible.cfg + | hosts + | + +---group_vars + | all.yml + | + +---playbooks + | clean.yaml + | setup.yaml + | + \---roles + +---clean-collectd + | \---tasks + | main.yml + | + +---collectd + +---files + | collectd.conf.j2 + | + \---tasks + main.yml + + +Summary of Roles +================== + +A brief description of the Ansible playbook roles, +which are used to deploy the monitoring cluster + +Ansible Server Roles +---------------------- + +Ansible Server, this part consists of the roles used to deploy +Prometheus Alertmanager Grafana stack on the server-side + +Role: Monitoring +~~~~~~~~~~~~~~~~~~ + +Deployment and configuration of PAG stack along with collectd-exporter, +cadvisor and node-exporter. + +Role: Clean-Monitoring +~~~~~~~~~~~~~~~~~~~~~~~~ + +Removes all the components deployed by the Monitoring role. + + +File-Task Mapping and Configurable Parameters +================================================ + +Ansible Server +---------------- + +Role: Monitoring +~~~~~~~~~~~~~~~~~~~ + +Alert Manager +^^^^^^^^^^^^^^^ + +File: alertmanager-config.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-config.yaml + +Task: Configures Receivers for alertmanager + +Summary: A configmap, currently configures webhook for alertmanager, +can be used to configure any kind of receiver + +Configurable Parameters: + receiver.url: change to the webhook receiver's URL + route: Can be used to add receivers + + +File: alertmanager-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-deployment.yaml + +Task: Deploys alertmanager instance + +Summary: A Deployment, deploys 1 replica of alertmanager + + +File: alertmanager-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-service.yaml + +Task: Creates a K8s service for alertmanager + +Summary: A Nodeport type of service, so that user can create "silences", +view the status of alerts from the native alertmanager dashboard / UI. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + +**Note: alertmanager1-deployment, alertmanager1-service are the same as +alertmanager-deployment and alertmanager-service respectively.** + +CAdvisor +^^^^^^^^^^^ + +File: cadvisor-daemonset.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/cadvisor/cadvisor-daemonset.yaml + +Task: To create a cadvisor daemonset + +Summary: A daemonset, used to scrape data of the kubernetes cluster itself, +its a daemonset so an instance is run on every node. + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: cadvisor-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/cadvisor/cadvisor-service.yaml + +Task: To create a cadvisor service + +Summary: A ClusterIP service for cadvisor to communicate with prometheus + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Collectd Exporter +^^^^^^^^^^^^^^^^^^^^ + +File: collectd-exporter-deployment.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml + +Task: To create a collectd replica + +Summary: A deployment, acts as receiver for collectd data sent by client machines, +prometheus pulls data from this exporter + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: collectd-exporter.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/collectd-exporter/collectd-exporter.yaml + +Task: To create a collectd service + +Summary: A NodePort service for collectd-exporter to hold data for prometheus +to scrape + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Grafana +^^^^^^^^^ + +File: grafana-datasource-config.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-datasource-config.yaml + +Task: To create config file for grafana + +Summary: A configmap, adds prometheus datasource in grafana + + +File: grafana-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-deployment.yaml + +Task: To create a grafana deployment + +Summary: The grafana deployment creates a single replica of grafana, +with preconfigured prometheus datasource. + +Configurable Parameters: + spec.template.spec.ports: Edit ports + spec.template.spec.env: Add / Edit environment variables + + +File: grafana-pv.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-pv.yaml + +Task: To create a persistent volume for grafana + +Summary: A persistent volume for grafana. + +Configurable Parameters: + spec.capacity.storage: Increase / decrease size + spec.accessModes: To change the way PV is accessed. + spec.nfs.server: To change the ip address of NFS server + spec.nfs.path: To change the path of the server + + +File: grafana-pvc.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-pvc.yaml + +Task: To create a persistent volume claim for grafana + +Summary: A persistent volume claim for grafana. + +Configurable Parameters: + spec.resources.requests.storage: Increase / decrease size + + +File: grafana-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-service.yaml + +Task: To create a service for grafana + +Summary: A Nodeport type of service, so that users actually connect to, +view the dashboard / UI. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + + +Kube State Metrics +^^^^^^^^^^^^^^^^^^^^ + +File: kube-state-metrics-deployment.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/kube-state-metrics/kube-state-metrics-deployment.yaml + +Task: To create a kube-state-metrics instance + +Summary: A deployment, used to collect metrics of the kubernetes cluster iteself + +Configurable Parameters: + spec.template.spec.containers.ports: Port of the container + + +File: kube-state-metrics-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/kube-state-metrics/kube-state-metrics-service.yaml + +Task: To create a collectd service + +Summary: A NodePort service for collectd-exporter to hold data for prometheus +to scrape + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Node Exporter +^^^^^^^^^^^^^^^ + +File: node-exporter-daemonset.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/node-exporter/node-exporter-daemonset.yaml + +Task: To create a node exporter daemonset + +Summary: A daemonset, used to scrape data of the host machines / node, +its a daemonset so an instance is run on every node. + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: node-exporter-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/node-exporter/node-exporter-service.yaml + +Task: To create a node exporter service + +Summary: A ClusterIP service for node exporter to communicate with Prometheus + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Prometheus +^^^^^^^^^^^^^ + +File: prometheus-config.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-config.yaml + +Task: To create a config file for Prometheus + +Summary: A configmap, adds alert rules. + +Configurable Parameters: + data.alert.rules: Add / Edit alert rules + + +File: prometheus-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-deployment.yaml + +Task: To create a Prometheus deployment + +Summary: The Prometheus deployment creates a single replica of Prometheus, +with preconfigured Prometheus datasource. + +Configurable Parameters: + spec.template.spec.affinity: To change the node affinity, + make sure only 1 instance of prometheus is + running on 1 node. + + spec.template.spec.ports: Add / Edit container port + + +File: prometheus-pv.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-pv.yaml + +Task: To create a persistent volume for Prometheus + +Summary: A persistent volume for Prometheus. + +Configurable Parameters: + spec.capacity.storage: Increase / decrease size + spec.accessModes: To change the way PV is accessed. + spec.hostpath.path: To change the path of the volume + + +File: prometheus-pvc.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-pvc.yaml + +Task: To create a persistent volume claim for Prometheus + +Summary: A persistent volume claim for Prometheus. + +Configurable Parameters: + spec.resources.requests.storage: Increase / decrease size + + +File: prometheus-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-service.yaml + +Task: To create a service for prometheus + +Summary: A Nodeport type of service, prometheus native dashboard +available here. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + + +File: main-prometheus-server.yaml +''''''''''''''''''''''''''''''''''' +Path: monitoring/files/prometheus/main-prometheus-service.yaml + +Task: A service that connects both prometheus instances. + +Summary: A Nodeport service for other services to connect to the Prometheus cluster. +As HA Prometheus needs to independent instances of Prometheus scraping the same inputs +having the same configuration + +**Note: prometheus-deployment, prometheus1-service are the same as +prometheus-deployment and prometheus-service respectively.** + + +Ansible Client Roles +---------------------- + +Role: Collectd +~~~~~~~~~~~~~~~~~~ + +File: main.yml +^^^^^^^^^^^^^^^^ +Path: collectd/tasks/main.yaml + +Task: Install collectd along with prerequisites + +Associated template file: + +- collectd.conf.j2 +Path: collectd/files/collectd.conf.j2 + +Summary: Edit this file to change the default configuration to +be installed on the client's machine diff --git a/docs/lma/metrics/images/dataflow.png b/docs/lma/metrics/images/dataflow.png Binary files differnew file mode 100644 index 00000000..ca1ec908 --- /dev/null +++ b/docs/lma/metrics/images/dataflow.png diff --git a/docs/lma/metrics/images/setup.png b/docs/lma/metrics/images/setup.png Binary files differnew file mode 100644 index 00000000..ce6a1274 --- /dev/null +++ b/docs/lma/metrics/images/setup.png diff --git a/docs/lma/metrics/userguide.rst b/docs/lma/metrics/userguide.rst new file mode 100644 index 00000000..0ee4a238 --- /dev/null +++ b/docs/lma/metrics/userguide.rst @@ -0,0 +1,230 @@ +================= +Metrics +================= +Table of Contents +================= +.. contents:: +.. section-numbering:: + +Setup +======= + +Prerequisites +------------------------- +- Require 3 VMs to setup K8s +- ``$ sudo yum install ansible`` +- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module) +- Update IPs in all these files (if changed) + - ``ansible-server/group_vars/all.yml`` (IP of apiserver and hostname) + - ``ansible-server/hosts`` (IP of VMs to install) + - ``ansible-server/roles/monitoring/files/grafana/grafana-pv.yaml`` (IP of NFS-Server) + - ``ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` (IP of alert-receiver) + +Setup Structure +--------------- +.. image:: images/setup.png + +Installation - Client Side +---------------------------- + +Nodes +````` +- **Node1** = 10.10.120.21 +- **Node4** = 10.10.120.24 + +How installation is done? +````````````````````````` +Ansible playbook available in ``tools/lma/ansible-client`` folder + +- ``cd tools/lma/ansible-client`` +- ``ansible-playbook setup.yaml`` + +This deploys collectd and configures it to send data to collectd exporter +configured at 10.10.120.211 (ip address of current instance of collectd-exporter) +Please make appropriate changes in the config file present in ``tools/lma/ansible-client/roles/collectd/files/`` + +Installation - Server Side +---------------------------- + +Nodes +`````` + +Inside Jumphost - POD12 + - **VM1** = 10.10.120.211 + - **VM2** = 10.10.120.203 + - **VM3** = 10.10.120.204 + + +How installation is done? +````````````````````````` +**Using Ansible:** + - **K8s** + - **Prometheus:** 2 independent deployments + - **Alertmanager:** 2 independent deployments (cluster peers) + - **Grafana:** 1 Replica deployment + - **cAdvisor:** 1 daemonset, i.e 3 replicas, one on each node + - **collectd-exporter:** 1 Replica + - **node-exporter:** 1 statefulset with 3 replicas + - **kube-state-metrics:** 1 deployment + - **NFS Server:** at each VM to store grafana data at following path + - ``/usr/share/monitoring_data/grafana`` + +How to setup? +````````````` +- **To setup K8s cluster, EFK and PAG:** Run the ansible-playbook ``ansible/playbooks/setup.yaml`` +- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml`` + +Do we have HA? +```````````````` +Yes + +Configuration +============= + +K8s +--- +Path to all yamls (Server Side) +```````````````````````````````` +``tools/lma/ansible-server/roles/monitoring/files/`` + +K8s namespace +````````````` +``monitoring`` + +Configuration +--------------------------- + +Serivces and Ports +`````````````````````````` + +Services and their ports are listed below, +one can go to IP of any node on the following ports, +service will correctly redirect you + + + ====================== ======= + Service Port + ====================== ======= + Prometheus 30900 + Prometheus1 30901 + Main-Prometheus 30902 + Alertmanager 30930 + Alertmanager1 30931 + Grafana 30000 + Collectd-exporter 30130 + ====================== ======= + +How to change Configuration? +------------------------------ +- Ports, names of the containers, pretty much every configuration can be modified by changing the required values in the respective yaml files (``/tools/lma/ansible-server/roles/monitoring/``) +- For metrics, on the client's machine, edit the collectd's configuration (jinja2 template) file, and add required plugins (``/tools/lma/ansible-client/roles/collectd/files/collectd.conf.j2``). + For more details refer `this <https://collectd.org/wiki/index.php/First_steps>`_ + +Where to send metrics? +------------------------ + +Metrics are sent to collectd exporter. +UDP packets are sent to port 38026 +(can be configured and checked at +``tools/lma/ansible-server/roles/monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml``) + +Data Management +================================ + +DataFlow: +-------------- +.. image:: images/dataFlow.png + +Where is the data stored now? +---------------------------------- + - Grafana data (including dashboards) ==> On master, at ``/usr/share/monitoring_data/grafana`` (its accessed by Presistent volume via NFS) + - Prometheus Data ==> On VM2 and VM3, at /usr/share/monitoring_data/prometheus + + **Note: Promethei data also are independent of each other, a shared data solution gave errors** + +Do we have backup of data? +------------------------------- + Promethei even though independent scrape same targets, + have same alert rules, therefore generate very similar data. + + Grafana's NFS part of the data has no backup + Dashboards' json are available in the ``/tools/lma/metrics/dashboards`` directory + +When containers are restarted, the data is still accessible? +----------------------------------------------------------------- + Yes, unless the data directories are deleted ``(/usr/share/monitoring_data/*)`` from each node + +Alert Management +================== + +Configure Alert receiver +-------------------------- +- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` +- Under the config.yml section under receivers, add, update, delete receivers +- Currently ip of unified alert receiver is used. +- Alertmanager supports multiple types of receivers, you can get a `list here <https://prometheus.io/docs/alerting/latest/configuration/>`_ + +Add new alerts +-------------------------------------- +- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/prometheus/prometheus-config.yaml`` +- Under the data section alert.rules file is mounted on the config-map. +- In this file alerts are divided in 4 groups, namely: + - targets + - host and hardware + - container + - kubernetes +- Add alerts under exisiting group or add new group. Please follow the structure of the file for adding new group +- To add new alert: + - Use the following structure: + + alert: alertname + + expr: alert rule (generally promql conditional query) + + for: time-range (eg. 5m, 10s, etc, the amount of time the condition needs to be true for the alert to be triggered) + + labels: + + severity: critical (other severity options and other labels can be added here) + + type: hardware + + annotations: + + summary: <summary of the alert> + + description: <descibe the alert here> + +- For an exhaustive alerts list you can have a look `here <https://awesome-prometheus-alerts.grep.to/>`_ + +Troubleshooting +=============== +No metrics received in grafana plot +--------------------------------------------- +- Check if all configurations are correctly done. +- Go to main-prometheus's port and any one VMs' ip, and check if prometheus is getting the metrics +- If prometheus is getting them, read grafana's logs (``kubectl -n monitoring logs <name_of_grafana_pod>``) +- Else, have a look at collectd exporter's metrics endpoint (eg. 10.10.120.211:30103/metrics) +- If collectd is getting them, check prometheus's config file if collectd's ip is correct over there. +- Else ssh to master, check which node collectd-exporter is scheduled (lets say vm2) +- Now ssh to vm2 +- Use ``tcpdump -i ens3 #the interface used to connect to the internet > testdump`` +- Grep your client node's ip and check if packets are reaching our monitoring cluster (``cat testdump | grep <ip of client>``) +- Ideally you should see packets reaching the node, if so please see if the collectd-exporter is running correctly, check its logs. +- If no packets are received, error is on the client side, check collectd's config file and make sure correct collectd-exporter ip is used in the ``<network>`` section. + +If no notification received +--------------------------- +- Go to main-prometheus's port and any one VMs' ip,(eg. 10.10.120.211:30902) and check if prometheus is getting the metrics +- If no, read "No metrics received in grafana plot" section, else read ahead. +- Check IP of alert-receiver, you can see this by going to alertmanager-ip:port and check if alertmanager is configured correctly. +- If yes, paste the alert rule in the prometheus' query-box and see if any metric staisfy the condition. +- You may need to change alert rules in the alert.rules section of prometheus-config.yaml if there was a bug in the alert's rule. (please read the "Add new alerts" section for detailed instructions) + +Reference +========= +- `Prometheus K8S deployment <https://www.metricfire.com/blog/how-to-deploy-prometheus-on-kubernetes/>`_ +- `HA Prometheus <https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available>`_ +- `Data Flow Diagram <https://drive.google.com/file/d/1D--LXFqU_H-fqpD57H3lJFOqcqWHoF0U/view?usp=sharing>`_ +- `Collectd Configuration <https://docs.opnfv.org/en/stable-fraser/submodules/barometer/docs/release/userguide/docker.userguide.html#build-the-collectd-docker-image>`_ +- `Alertmanager Rule Config <https://awesome-prometheus-alerts.grep.to/>`_ |