aboutsummaryrefslogtreecommitdiffstats
path: root/docs/lma
diff options
context:
space:
mode:
Diffstat (limited to 'docs/lma')
-rw-r--r--docs/lma/devguide.rst147
-rw-r--r--docs/lma/logs/images/elasticsearch.pngbin0 -> 36046 bytes
-rw-r--r--docs/lma/logs/images/fluentd-cs.pngbin0 -> 40226 bytes
-rw-r--r--docs/lma/logs/images/fluentd-ss.pngbin0 -> 18331 bytes
-rw-r--r--docs/lma/logs/images/nginx.pngbin0 -> 36737 bytes
-rw-r--r--docs/lma/logs/images/setup.pngbin0 -> 43503 bytes
-rw-r--r--docs/lma/logs/userguide.rst348
-rw-r--r--docs/lma/metrics/devguide.rst474
-rw-r--r--docs/lma/metrics/images/dataflow.pngbin0 -> 42443 bytes
-rw-r--r--docs/lma/metrics/images/setup.pngbin0 -> 15019 bytes
-rw-r--r--docs/lma/metrics/userguide.rst230
11 files changed, 1199 insertions, 0 deletions
diff --git a/docs/lma/devguide.rst b/docs/lma/devguide.rst
new file mode 100644
index 00000000..c72b8b12
--- /dev/null
+++ b/docs/lma/devguide.rst
@@ -0,0 +1,147 @@
+=================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+Ansible Client-side
+====================
+
+Ansible File Organisation
+--------------------------
+Files Structure::
+
+ ansible-client
+ ├── ansible.cfg
+ ├── hosts
+ ├── playbooks
+ │ └── setup.yaml
+ └── roles
+ ├── clean-td-agent
+ │ └── tasks
+ │ └── main.yml
+ └── td-agent
+ ├── files
+ │ └── td-agent.conf
+ └── tasks
+ └── main.yml
+
+Summary of roles
+-----------------
+====================== ======================
+Roles Description
+====================== ======================
+``td-agent`` Install Td-agent & change configuration file
+``clean-td-agent`` Unistall Td-agent
+====================== ======================
+
+Configurable Parameters
+------------------------
+====================================================== ====================== ======================
+File (ansible-client/roles/) Parameter Description
+====================================================== ====================== ======================
+``td-agent/files/td-agent.conf`` host Fluentd-server IP
+``td-agent/files/td-agent.conf`` port Fluentd-Server Port
+====================================================== ====================== ======================
+
+Ansible Server-side
+====================
+
+Ansible File Organisation
+--------------------------
+Files Structure::
+
+ ansible-server
+ ├── ansible.cfg
+ ├── group_vars
+ │ └── all.yml
+ ├── hosts
+ ├── playbooks
+ │ └── setup.yaml
+ └── roles
+ ├── clean-logging
+ │ └── tasks
+ │ └── main.yml
+ ├── k8s-master
+ │ └── tasks
+ │ └── main.yml
+ ├── k8s-pre
+ │ └── tasks
+ │ └── main.yml
+ ├── k8s-worker
+ │ └── tasks
+ │ └── main.yml
+ ├── logging
+ │ ├── files
+ │ │ ├── elastalert
+ │ │ │ ├── ealert-conf-cm.yaml
+ │ │ │ ├── ealert-key-cm.yaml
+ │ │ │ ├── ealert-rule-cm.yaml
+ │ │ │ └── elastalert.yaml
+ │ │ ├── elasticsearch
+ │ │ │ ├── elasticsearch.yaml
+ │ │ │ └── user-secret.yaml
+ │ │ ├── fluentd
+ │ │ │ ├── fluent-cm.yaml
+ │ │ │ ├── fluent-service.yaml
+ │ │ │ └── fluent.yaml
+ │ │ ├── kibana
+ │ │ │ └── kibana.yaml
+ │ │ ├── namespace.yaml
+ │ │ ├── nginx
+ │ │ │ ├── nginx-conf-cm.yaml
+ │ │ │ ├── nginx-key-cm.yaml
+ │ │ │ ├── nginx-service.yaml
+ │ │ │ └── nginx.yaml
+ │ │ ├── persistentVolume.yaml
+ │ │ └── storageClass.yaml
+ │ └── tasks
+ │ └── main.yml
+ └── nfs
+ └── tasks
+ └── main.yml
+
+Summary of roles
+-----------------
+====================== ======================
+Roles Description
+====================== ======================
+``k8s-pre`` Pre-requisite for installing K8s, like installing docker & K8s, disable swap etc.
+``k8s-master`` Reset K8s & make a master
+``k8s-worker`` Join woker nodes with token
+``logging`` EFK & elastalert setup in K8s
+``clean logging`` Remove EFK & elastalert setup from K8s
+``nfs`` Start a NFS server to store Elasticsearch data
+====================== ======================
+
+Configurable Parameters
+------------------------
+========================================================================= ============================================ ======================
+File (ansible-server/roles/) Parameter name Description
+========================================================================= ============================================ ======================
+**Role: logging**
+``logging/files/persistentVolume.yaml`` storage Increase or Decrease Storage size of Persistent Volume size for each VM
+``logging/files/kibana/kibana.yaml`` version To Change the Kibana Version
+``logging/files/kibana/kibana.yaml`` count To increase or decrease the replica
+``logging/files/elasticsearch/elasticsearch.yaml`` version To Change the Elasticsearch Version
+``logging/files/elasticsearch/elasticsearch.yaml`` nodePort To Change Service Port
+``logging/files/elasticsearch/elasticsearch.yaml`` storage Increase or Decrease Storage size of Elasticsearch data for each VM
+``logging/files/elasticsearch/elasticsearch.yaml`` nodeAffinity -> values (hostname) In which VM Elasticsearch master or data pod will run (change the hostname to run the Elasticsearch master or data pod on a specific node)
+``logging/files/elasticsearch/user-secret.yaml`` stringData Add Elasticsearch User & its roles (`Elastic Docs <https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-users-and-roles.html#k8s_file_realm>`_)
+``logging/files/fluentd/fluent.yaml`` replicas To increase or decrease the replica
+``logging/files/fluentd/fluent-service.yaml`` nodePort To Change Service Port
+``logging/files/fluentd/fluent-cm.yaml`` index_template.json -> number_of_replicas To increase or decrease replica of data in Elasticsearch
+``logging/files/fluentd/fluent-cm.yaml`` fluent.conf Server port & other Fluentd Configuration
+``logging/files/nginx/nginx.yaml`` replicas To increase or decrease the replica
+``logging/files/nginx/nginx-service.yaml`` nodePort To Change Service Port
+``logging/files/nginx/nginx-key-cm.yaml`` kibana-access.key, kibana-access.pem Key file for HTTPs Connection
+``logging/files/nginx/nginx-conf-cm.yaml`` - Nginx Configuration
+``logging/files/elastalert/elastalert.yaml`` replicas To increase or decrease the replica
+``logging/files/elastalert/ealert-key-cm.yaml`` elastalert.key, elastalert.pem Key file for HTTPs Connection
+``logging/files/elastalert/ealert-conf-cm.yaml`` run_every How often ElastAlert will query Elasticsearch
+``logging/files/elastalert/ealert-conf-cm.yaml`` alert_time_limit If an alert fails for some reason, ElastAlert will retry sending the alert until this time period has elapsed
+``logging/files/elastalert/ealert-conf-cm.yaml`` es_host, es_port Elasticsearch Serivce name & port in K8s
+``logging/files/elastalert/ealert-rule-cm.yaml`` http_post_url Alert Receiver IP (`Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_)
+**Role: nfs**
+``nfs/tasks/main.yml`` line Path of NFS storage
+========================================================================= ============================================ ======================
diff --git a/docs/lma/logs/images/elasticsearch.png b/docs/lma/logs/images/elasticsearch.png
new file mode 100644
index 00000000..f0b876f5
--- /dev/null
+++ b/docs/lma/logs/images/elasticsearch.png
Binary files differ
diff --git a/docs/lma/logs/images/fluentd-cs.png b/docs/lma/logs/images/fluentd-cs.png
new file mode 100644
index 00000000..513bb3ef
--- /dev/null
+++ b/docs/lma/logs/images/fluentd-cs.png
Binary files differ
diff --git a/docs/lma/logs/images/fluentd-ss.png b/docs/lma/logs/images/fluentd-ss.png
new file mode 100644
index 00000000..4e9ab112
--- /dev/null
+++ b/docs/lma/logs/images/fluentd-ss.png
Binary files differ
diff --git a/docs/lma/logs/images/nginx.png b/docs/lma/logs/images/nginx.png
new file mode 100644
index 00000000..a0b00514
--- /dev/null
+++ b/docs/lma/logs/images/nginx.png
Binary files differ
diff --git a/docs/lma/logs/images/setup.png b/docs/lma/logs/images/setup.png
new file mode 100644
index 00000000..267685fa
--- /dev/null
+++ b/docs/lma/logs/images/setup.png
Binary files differ
diff --git a/docs/lma/logs/userguide.rst b/docs/lma/logs/userguide.rst
new file mode 100644
index 00000000..b410ee6c
--- /dev/null
+++ b/docs/lma/logs/userguide.rst
@@ -0,0 +1,348 @@
+=================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+Setup
+======
+
+Prerequisites
+-------------------------
+- Require 3 VMs to setup K8s
+- ``$ sudo yum install ansible``
+- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module)
+- Update IPs in all these files (if changed)
+ ====================================================================== ======================
+ Path Description
+ ====================================================================== ======================
+ ``ansible-server/group_vars/all.yml`` IP of K8s apiserver and VM hostname
+ ``ansible-server/hosts`` IP of VMs to install
+ ``ansible-server/roles/logging/files/persistentVolume.yaml`` IP of NFS-Server
+ ``ansible-server/roles/logging/files/elastalert/ealert-rule-cm.yaml`` IP of alert-receiver
+ ====================================================================== ======================
+
+Architecture
+--------------
+.. image:: images/setup.png
+
+Installation - Clientside
+-------------------------
+
+Nodes
+`````
+- **Node1** = 10.10.120.21
+- **Node4** = 10.10.120.24
+
+How installation is done?
+`````````````````````````
+- TD-agent installation
+ ``$ curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent3.sh | sh``
+- Copy the TD-agent config file in **Node1**
+ ``$ cp tdagent-client-config/node1.conf /etc/td-agent/td-agent.conf``
+- Copy the TD-agent config file in **Node4**
+ ``$ cp tdagent-client-config/node4.conf /etc/td-agent/td-agent.conf``
+- Restart the service
+ ``$ sudo service td-agent restart``
+
+Installation - Serverside
+-------------------------
+
+Nodes
+`````
+Inside Jumphost - POD12
+ - **VM1** = 10.10.120.211
+ - **VM2** = 10.10.120.203
+ - **VM3** = 10.10.120.204
+
+
+How installation is done?
+`````````````````````````
+**Using Ansible:**
+ - **K8s**
+ - **Elasticsearch:** 1 Master & 1 Data node at each VM
+ - **Kibana:** 1 Replicas
+ - **Nginx:** 2 Replicas
+ - **Fluentd:** 2 Replicas
+ - **Elastalert:** 1 Replica (get duplicate alert, if increase replica)
+ - **NFS Server:** at each VM to store elasticsearch data at following path
+ - ``/srv/nfs/master``
+ - ``/srv/nfs/data``
+
+How to setup?
+`````````````
+- **To setup K8s cluster and EFK:** Run the ansible-playbook ``ansible/playbooks/setup.yaml``
+- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml``
+
+Do we have HA?
+````````````````
+Yes
+
+Configuration
+=============
+
+K8s
+---
+Path of all yamls (Serverside)
+````````````````````````````````
+``ansible-server/roles/logging/files/``
+
+K8s namespace
+`````````````
+``logging``
+
+K8s Service details
+````````````````````
+``$ kubectl get svc -n logging``
+
+Elasticsearch Configuration
+---------------------------
+
+Elasticsearch Setup Structure
+`````````````````````````````
+.. image:: images/elasticsearch.png
+
+Elasticsearch service details
+`````````````````````````````
+| **Service Name:** ``logging-es-http``
+| **Service Port:** ``9200``
+| **Service Type:** ``ClusterIP``
+
+How to get elasticsearch default username & password?
+`````````````````````````````````````````````````````
+- User1 (custom user):
+ | **Username:** ``elasticsearch``
+ | **Password:** ``password123``
+- User2 (by default created by Elastic Operator):
+ | **Username:** ``elastic``
+ | To get default password:
+ | ``$ PASSWORD=$(kubectl get secret -n logging logging-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')``
+ | ``$ echo $PASSWORD``
+
+How to increase replica of any index?
+````````````````````````````````````````
+| $ curl -k -u "elasticsearch:password123" -H 'Content-Type: application/json' -XPUT "https://10.10.120.211:9200/indexname*/_settings" -d '
+| {
+| "index" : {
+| "number_of_replicas" : "2" }
+| }'
+
+Index Life
+```````````
+**30 Days**
+
+Kibana Configuration
+--------------------
+
+Kibana Service details
+````````````````````````
+| **Service Name:** ``logging-kb-http``
+| **Service Port:** ``5601``
+| **Service Type:** ``ClusterIP``
+
+Nginx Configuration
+--------------------
+IP
+````
+https://10.10.120.211:32000
+
+Nginx Setup Structure
+`````````````````````
+.. image:: images/nginx.png
+
+Ngnix Service details
+`````````````````````
+| **Service Name:** ``nginx``
+| **Service Port:** ``32000``
+| **Service Type:** ``NodePort``
+
+Why NGINX is used?
+```````````````````
+`Securing ELK using Nginx <https://logz.io/blog/securing-elk-nginx/>`_
+
+Nginx Configuration
+````````````````````
+**Path:** ``ansible-server/roles/logging/files/nginx/nginx-conf-cm.yaml``
+
+Fluentd Configuration - Clientside (Td-agent)
+---------------------------------------------
+
+Fluentd Setup Structure
+````````````````````````
+.. image:: images/fluentd-cs.png
+
+Log collection paths
+`````````````````````
+- ``/tmp/result*/*.log``
+- ``/tmp/result*/*.dat``
+- ``/tmp/result*/*.csv``
+- ``/tmp/result*/stc-liveresults.dat.*``
+- ``/var/log/userspace*.log``
+- ``/var/log/sriovdp/*.log.*``
+- ``/var/log/pods/**/*.log``
+
+Logs sends to
+`````````````
+Another fluentd instance of K8s cluster (K8s Master: 10.10.120.211) at Jumphost.
+
+Td-agent logs
+`````````````
+Path of td-agent logs: ``/var/log/td-agent/td-agent.log``
+
+Td-agent configuration
+````````````````````````
+| Path of conf file: ``/etc/td-agent/td-agent.conf``
+| **If any changes is made in td-agent.conf then restart the td-agent service,** ``$ sudo service td-agent restart``
+
+Config Description
+````````````````````
+- Get the logs from collection path
+- | Convert to this format
+ | {
+ | msg: "log line"
+ | log_path: “/file/path”
+ | file: “file.name”
+ | host: “pod12-node4”
+ | }
+- Sends it to fluentd
+
+Fluentd Configuration - Serverside
+----------------------------------
+
+Fluentd Setup Structure
+````````````````````````
+.. image:: images/fluentd-ss.png
+
+Fluentd Service details
+````````````````````````
+| **Service Name:** ``fluentd``
+| **Service Port:** ``32224``
+| **Service Type:** ``NodePort``
+
+Logs sends to
+`````````````
+Elasticsearch service (https://logging-es-http:9200)
+
+Config Description
+````````````````````
+- **Step 1**
+ - Get the logs from Node1 & Node4
+- **Step 2**
+ ======================================== ======================
+ log_path add tag (for routing)
+ ======================================== ======================
+ ``/tmp/result.*/.*errors.dat`` errordat.log
+ ``/tmp/result.*/.*counts.dat`` countdat.log
+ ``/tmp/result.*/stc-liveresults.dat.tx`` stcdattx.log
+ ``/tmp/result.*/stc-liveresults.dat.rx`` stcdatrx.log
+ ``/tmp/result.*/.*Statistics.csv`` ixia.log
+ ``/tmp/result.*/vsperf-overall*`` vsperf.log
+ ``/tmp/result.*/vswitchd*`` vswitchd.log
+ ``/var/log/userspace*`` userspace.log
+ ``/var/log/sriovdp*`` sriovdp.log
+ ``/var/log/pods*`` pods.log
+ ======================================== ======================
+
+- **Step 3**
+ Then parse each type using tags.
+ - error.conf: to find any error
+ - time-series.conf: to parse time series data
+ - time-analysis.conf: to calculate time analyasis
+- **Step 4**
+ ================================ ======================
+ host add tag (for routing)
+ ================================ ======================
+ ``pod12-node4`` node4
+ ``worker`` node1
+ ================================ ======================
+- **Step 5**
+ ================================ ======================
+ Tag elasticsearch
+ ================================ ======================
+ ``node4`` index “node4*”
+ ``node1`` index “node1*”
+ ================================ ======================
+
+Elastalert
+----------
+
+Send alert if
+``````````````
+- Blacklist
+ - "Failed to run test"
+ - "Failed to execute in '30' seconds"
+ - "('Result', 'Failed')"
+ - "could not open socket: connection refused"
+ - "Input/output error"
+ - "dpdk|ERR|EAL: Error - exiting with code: 1"
+ - "Failed to execute in '30' seconds"
+ - "dpdk|ERR|EAL: Driver cannot attach the device"
+ - "dpdk|EMER|Cannot create lock on"
+ - "dpdk|ERR|VHOST_CONFIG: * device not found"
+- Time
+ - vswitch_duration > 3 sec
+
+How to configure alert?
+````````````````````````
+- Add your rule in ``ansible/roles/logging/files/elastalert/ealert-rule-cm.yaml`` (`Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_)
+ | name: anything
+ | type: <check-above-link> #The RuleType to use
+ | index: node4* #index name
+ | realert:
+ | minutes: 0 #to get alert for all cases after each interval
+ | alert: post #To send alert as HTTP POST
+ | http_post_url: "http://url"
+
+- Mount this file to elastalert pod in ``ansible/roles/logging/files/elastalert/elastalert.yaml``.
+
+Alert Format
+````````````
+{"type": "pattern-match", "label": "failed", "index": "node4-20200815", "log": "error-log-line", "log-path": "/tmp/result/file.log", "reson": "error-message" }
+
+Data Management
+===============
+
+Elasticsearch
+-------------
+
+Where data is stored now?
+`````````````````````````
+Data is stored in NFS server with 1 replica of each index (default). Path of data are following:
+ - ``/srv/nfs/data (VM1)``
+ - ``/srv/nfs/data (VM2)``
+ - ``/srv/nfs/data (VM3)``
+ - ``/srv/nfs/master (VM1)``
+ - ``/srv/nfs/master (VM2)``
+ - ``/srv/nfs/master (VM3)``
+If user wants to change from NFS to local storage
+``````````````````````````````````````````````````
+Yes, user can do this, need to configure persistent volume. (``ansible-server/roles/logging/files/persistentVolume.yaml``)
+
+Do we have backup of data?
+````````````````````````````
+1 replica of each index
+
+When K8s restart, the data is still accessible?
+`````````````````````````````````````````````````````
+Yes (If data is not deleted from /srv/nfs/data)
+
+Troubleshooting
+===============
+If no logs receiving in Elasticsearch
+--------------------------------------
+- Check IP & port of server-fluentd in client config.
+- Check client-fluentd logs, ``$ sudo tail -f /var/log/td-agent/td-agent.log``
+- Check server-fluentd logs, ``$ sudo kubectl logs -n logging <fluentd-pod-name>``
+
+If no notification received
+---------------------------
+- Search your "log" in Elasticsearch.
+- Check config of elastalert
+- Check IP of alert-receiver
+
+Reference
+=========
+- `Elastic cloud on K8s <https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html>`_
+- `HA Elasticsearch on K8s <https://www.elastic.co/blog/high-availability-elasticsearch-on-kubernetes-with-eck-and-gke>`_
+- `Fluentd Configuration <https://docs.fluentd.org/configuration/config-file>`_
+- `Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_ \ No newline at end of file
diff --git a/docs/lma/metrics/devguide.rst b/docs/lma/metrics/devguide.rst
new file mode 100644
index 00000000..93d33016
--- /dev/null
+++ b/docs/lma/metrics/devguide.rst
@@ -0,0 +1,474 @@
+====================
+Metrics Dev Guide
+====================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+
+Anible File Organization
+============================
+
+Ansible-Server
+----------------
+
+Please follow the following file structure:
+
+.. code-block:: bash
+
+ ansible-server
+ | ansible.cfg
+ | hosts
+ |
+ +---group_vars
+ | all.yml
+ |
+ +---playbooks
+ | clean.yaml
+ | setup.yaml
+ |
+ \---roles
+ +---clean-monitoring
+ | \---tasks
+ | main.yml
+ |
+ +---monitoring
+ +---files
+ | | monitoring-namespace.yaml
+ | |
+ | +---alertmanager
+ | | alertmanager-config.yaml
+ | | alertmanager-deployment.yaml
+ | | alertmanager-service.yaml
+ | | alertmanager1-deployment.yaml
+ | | alertmanager1-service.yaml
+ | |
+ | +---cadvisor
+ | | cadvisor-daemonset.yaml
+ | | cadvisor-service.yaml
+ | |
+ | +---collectd-exporter
+ | | collectd-exporter-deployment.yaml
+ | | collectd-exporter-service.yaml
+ | |
+ | +---grafana
+ | | grafana-datasource-config.yaml
+ | | grafana-deployment.yaml
+ | | grafana-pv.yaml
+ | | grafana-pvc.yaml
+ | | grafana-service.yaml
+ | |
+ | +---kube-state-metrics
+ | | kube-state-metrics-deployment.yaml
+ | | kube-state-metrics-service.yaml
+ | |
+ | +---node-exporter
+ | | nodeexporter-daemonset.yaml
+ | | nodeexporter-service.yaml
+ | |
+ | \---prometheus
+ | main-prometheus-service.yaml
+ | prometheus-config.yaml
+ | prometheus-deployment.yaml
+ | prometheus-pv.yaml
+ | prometheus-pvc.yaml
+ | prometheus-service.yaml
+ | prometheus1-deployment.yaml
+ | prometheus1-service.yaml
+ |
+ \---tasks
+ main.yml
+
+
+Ansible - Client
+------------------
+
+Please follow the following file structure:
+
+.. code-block:: bash
+
+ ansible-server
+ | ansible.cfg
+ | hosts
+ |
+ +---group_vars
+ | all.yml
+ |
+ +---playbooks
+ | clean.yaml
+ | setup.yaml
+ |
+ \---roles
+ +---clean-collectd
+ | \---tasks
+ | main.yml
+ |
+ +---collectd
+ +---files
+ | collectd.conf.j2
+ |
+ \---tasks
+ main.yml
+
+
+Summary of Roles
+==================
+
+A brief description of the Ansible playbook roles,
+which are used to deploy the monitoring cluster
+
+Ansible Server Roles
+----------------------
+
+Ansible Server, this part consists of the roles used to deploy
+Prometheus Alertmanager Grafana stack on the server-side
+
+Role: Monitoring
+~~~~~~~~~~~~~~~~~~
+
+Deployment and configuration of PAG stack along with collectd-exporter,
+cadvisor and node-exporter.
+
+Role: Clean-Monitoring
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Removes all the components deployed by the Monitoring role.
+
+
+File-Task Mapping and Configurable Parameters
+================================================
+
+Ansible Server
+----------------
+
+Role: Monitoring
+~~~~~~~~~~~~~~~~~~~
+
+Alert Manager
+^^^^^^^^^^^^^^^
+
+File: alertmanager-config.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-config.yaml
+
+Task: Configures Receivers for alertmanager
+
+Summary: A configmap, currently configures webhook for alertmanager,
+can be used to configure any kind of receiver
+
+Configurable Parameters:
+ receiver.url: change to the webhook receiver's URL
+ route: Can be used to add receivers
+
+
+File: alertmanager-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-deployment.yaml
+
+Task: Deploys alertmanager instance
+
+Summary: A Deployment, deploys 1 replica of alertmanager
+
+
+File: alertmanager-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/alertmanager/alertmanager-service.yaml
+
+Task: Creates a K8s service for alertmanager
+
+Summary: A Nodeport type of service, so that user can create "silences",
+view the status of alerts from the native alertmanager dashboard / UI.
+
+Configurable Parameters:
+ spec.type: Options : NodePort, ClusterIP, LoadBalancer
+ spec.ports: Edit / add ports to be handled by the service
+
+**Note: alertmanager1-deployment, alertmanager1-service are the same as
+alertmanager-deployment and alertmanager-service respectively.**
+
+CAdvisor
+^^^^^^^^^^^
+
+File: cadvisor-daemonset.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/cadvisor/cadvisor-daemonset.yaml
+
+Task: To create a cadvisor daemonset
+
+Summary: A daemonset, used to scrape data of the kubernetes cluster itself,
+its a daemonset so an instance is run on every node.
+
+Configurable Parameters:
+ spec.template.spec.ports: Port of the container
+
+
+File: cadvisor-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/cadvisor/cadvisor-service.yaml
+
+Task: To create a cadvisor service
+
+Summary: A ClusterIP service for cadvisor to communicate with prometheus
+
+Configurable Parameters:
+ spec.ports: Add / Edit ports
+
+
+Collectd Exporter
+^^^^^^^^^^^^^^^^^^^^
+
+File: collectd-exporter-deployment.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml
+
+Task: To create a collectd replica
+
+Summary: A deployment, acts as receiver for collectd data sent by client machines,
+prometheus pulls data from this exporter
+
+Configurable Parameters:
+ spec.template.spec.ports: Port of the container
+
+
+File: collectd-exporter.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/collectd-exporter/collectd-exporter.yaml
+
+Task: To create a collectd service
+
+Summary: A NodePort service for collectd-exporter to hold data for prometheus
+to scrape
+
+Configurable Parameters:
+ spec.ports: Add / Edit ports
+
+
+Grafana
+^^^^^^^^^
+
+File: grafana-datasource-config.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-datasource-config.yaml
+
+Task: To create config file for grafana
+
+Summary: A configmap, adds prometheus datasource in grafana
+
+
+File: grafana-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-deployment.yaml
+
+Task: To create a grafana deployment
+
+Summary: The grafana deployment creates a single replica of grafana,
+with preconfigured prometheus datasource.
+
+Configurable Parameters:
+ spec.template.spec.ports: Edit ports
+ spec.template.spec.env: Add / Edit environment variables
+
+
+File: grafana-pv.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-pv.yaml
+
+Task: To create a persistent volume for grafana
+
+Summary: A persistent volume for grafana.
+
+Configurable Parameters:
+ spec.capacity.storage: Increase / decrease size
+ spec.accessModes: To change the way PV is accessed.
+ spec.nfs.server: To change the ip address of NFS server
+ spec.nfs.path: To change the path of the server
+
+
+File: grafana-pvc.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-pvc.yaml
+
+Task: To create a persistent volume claim for grafana
+
+Summary: A persistent volume claim for grafana.
+
+Configurable Parameters:
+ spec.resources.requests.storage: Increase / decrease size
+
+
+File: grafana-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/grafana/grafana-service.yaml
+
+Task: To create a service for grafana
+
+Summary: A Nodeport type of service, so that users actually connect to,
+view the dashboard / UI.
+
+Configurable Parameters:
+ spec.type: Options : NodePort, ClusterIP, LoadBalancer
+ spec.ports: Edit / add ports to be handled by the service
+
+
+Kube State Metrics
+^^^^^^^^^^^^^^^^^^^^
+
+File: kube-state-metrics-deployment.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/kube-state-metrics/kube-state-metrics-deployment.yaml
+
+Task: To create a kube-state-metrics instance
+
+Summary: A deployment, used to collect metrics of the kubernetes cluster iteself
+
+Configurable Parameters:
+ spec.template.spec.containers.ports: Port of the container
+
+
+File: kube-state-metrics-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/kube-state-metrics/kube-state-metrics-service.yaml
+
+Task: To create a collectd service
+
+Summary: A NodePort service for collectd-exporter to hold data for prometheus
+to scrape
+
+Configurable Parameters:
+ spec.ports: Add / Edit ports
+
+
+Node Exporter
+^^^^^^^^^^^^^^^
+
+File: node-exporter-daemonset.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/node-exporter/node-exporter-daemonset.yaml
+
+Task: To create a node exporter daemonset
+
+Summary: A daemonset, used to scrape data of the host machines / node,
+its a daemonset so an instance is run on every node.
+
+Configurable Parameters:
+ spec.template.spec.ports: Port of the container
+
+
+File: node-exporter-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/node-exporter/node-exporter-service.yaml
+
+Task: To create a node exporter service
+
+Summary: A ClusterIP service for node exporter to communicate with Prometheus
+
+Configurable Parameters:
+ spec.ports: Add / Edit ports
+
+
+Prometheus
+^^^^^^^^^^^^^
+
+File: prometheus-config.yaml
+''''''''''''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-config.yaml
+
+Task: To create a config file for Prometheus
+
+Summary: A configmap, adds alert rules.
+
+Configurable Parameters:
+ data.alert.rules: Add / Edit alert rules
+
+
+File: prometheus-deployment.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-deployment.yaml
+
+Task: To create a Prometheus deployment
+
+Summary: The Prometheus deployment creates a single replica of Prometheus,
+with preconfigured Prometheus datasource.
+
+Configurable Parameters:
+ spec.template.spec.affinity: To change the node affinity,
+ make sure only 1 instance of prometheus is
+ running on 1 node.
+
+ spec.template.spec.ports: Add / Edit container port
+
+
+File: prometheus-pv.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-pv.yaml
+
+Task: To create a persistent volume for Prometheus
+
+Summary: A persistent volume for Prometheus.
+
+Configurable Parameters:
+ spec.capacity.storage: Increase / decrease size
+ spec.accessModes: To change the way PV is accessed.
+ spec.hostpath.path: To change the path of the volume
+
+
+File: prometheus-pvc.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-pvc.yaml
+
+Task: To create a persistent volume claim for Prometheus
+
+Summary: A persistent volume claim for Prometheus.
+
+Configurable Parameters:
+ spec.resources.requests.storage: Increase / decrease size
+
+
+File: prometheus-service.yaml
+'''''''''''''''''''''''''''''''''
+Path : monitoring/files/prometheus/prometheus-service.yaml
+
+Task: To create a service for prometheus
+
+Summary: A Nodeport type of service, prometheus native dashboard
+available here.
+
+Configurable Parameters:
+ spec.type: Options : NodePort, ClusterIP, LoadBalancer
+ spec.ports: Edit / add ports to be handled by the service
+
+
+File: main-prometheus-server.yaml
+'''''''''''''''''''''''''''''''''''
+Path: monitoring/files/prometheus/main-prometheus-service.yaml
+
+Task: A service that connects both prometheus instances.
+
+Summary: A Nodeport service for other services to connect to the Prometheus cluster.
+As HA Prometheus needs to independent instances of Prometheus scraping the same inputs
+having the same configuration
+
+**Note: prometheus-deployment, prometheus1-service are the same as
+prometheus-deployment and prometheus-service respectively.**
+
+
+Ansible Client Roles
+----------------------
+
+Role: Collectd
+~~~~~~~~~~~~~~~~~~
+
+File: main.yml
+^^^^^^^^^^^^^^^^
+Path: collectd/tasks/main.yaml
+
+Task: Install collectd along with prerequisites
+
+Associated template file:
+
+- collectd.conf.j2
+Path: collectd/files/collectd.conf.j2
+
+Summary: Edit this file to change the default configuration to
+be installed on the client's machine
diff --git a/docs/lma/metrics/images/dataflow.png b/docs/lma/metrics/images/dataflow.png
new file mode 100644
index 00000000..ca1ec908
--- /dev/null
+++ b/docs/lma/metrics/images/dataflow.png
Binary files differ
diff --git a/docs/lma/metrics/images/setup.png b/docs/lma/metrics/images/setup.png
new file mode 100644
index 00000000..ce6a1274
--- /dev/null
+++ b/docs/lma/metrics/images/setup.png
Binary files differ
diff --git a/docs/lma/metrics/userguide.rst b/docs/lma/metrics/userguide.rst
new file mode 100644
index 00000000..0ee4a238
--- /dev/null
+++ b/docs/lma/metrics/userguide.rst
@@ -0,0 +1,230 @@
+=================
+Metrics
+=================
+Table of Contents
+=================
+.. contents::
+.. section-numbering::
+
+Setup
+=======
+
+Prerequisites
+-------------------------
+- Require 3 VMs to setup K8s
+- ``$ sudo yum install ansible``
+- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module)
+- Update IPs in all these files (if changed)
+ - ``ansible-server/group_vars/all.yml`` (IP of apiserver and hostname)
+ - ``ansible-server/hosts`` (IP of VMs to install)
+ - ``ansible-server/roles/monitoring/files/grafana/grafana-pv.yaml`` (IP of NFS-Server)
+ - ``ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` (IP of alert-receiver)
+
+Setup Structure
+---------------
+.. image:: images/setup.png
+
+Installation - Client Side
+----------------------------
+
+Nodes
+`````
+- **Node1** = 10.10.120.21
+- **Node4** = 10.10.120.24
+
+How installation is done?
+`````````````````````````
+Ansible playbook available in ``tools/lma/ansible-client`` folder
+
+- ``cd tools/lma/ansible-client``
+- ``ansible-playbook setup.yaml``
+
+This deploys collectd and configures it to send data to collectd exporter
+configured at 10.10.120.211 (ip address of current instance of collectd-exporter)
+Please make appropriate changes in the config file present in ``tools/lma/ansible-client/roles/collectd/files/``
+
+Installation - Server Side
+----------------------------
+
+Nodes
+``````
+
+Inside Jumphost - POD12
+ - **VM1** = 10.10.120.211
+ - **VM2** = 10.10.120.203
+ - **VM3** = 10.10.120.204
+
+
+How installation is done?
+`````````````````````````
+**Using Ansible:**
+ - **K8s**
+ - **Prometheus:** 2 independent deployments
+ - **Alertmanager:** 2 independent deployments (cluster peers)
+ - **Grafana:** 1 Replica deployment
+ - **cAdvisor:** 1 daemonset, i.e 3 replicas, one on each node
+ - **collectd-exporter:** 1 Replica
+ - **node-exporter:** 1 statefulset with 3 replicas
+ - **kube-state-metrics:** 1 deployment
+ - **NFS Server:** at each VM to store grafana data at following path
+ - ``/usr/share/monitoring_data/grafana``
+
+How to setup?
+`````````````
+- **To setup K8s cluster, EFK and PAG:** Run the ansible-playbook ``ansible/playbooks/setup.yaml``
+- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml``
+
+Do we have HA?
+````````````````
+Yes
+
+Configuration
+=============
+
+K8s
+---
+Path to all yamls (Server Side)
+````````````````````````````````
+``tools/lma/ansible-server/roles/monitoring/files/``
+
+K8s namespace
+`````````````
+``monitoring``
+
+Configuration
+---------------------------
+
+Serivces and Ports
+``````````````````````````
+
+Services and their ports are listed below,
+one can go to IP of any node on the following ports,
+service will correctly redirect you
+
+
+ ====================== =======
+ Service Port
+ ====================== =======
+ Prometheus 30900
+ Prometheus1 30901
+ Main-Prometheus 30902
+ Alertmanager 30930
+ Alertmanager1 30931
+ Grafana 30000
+ Collectd-exporter 30130
+ ====================== =======
+
+How to change Configuration?
+------------------------------
+- Ports, names of the containers, pretty much every configuration can be modified by changing the required values in the respective yaml files (``/tools/lma/ansible-server/roles/monitoring/``)
+- For metrics, on the client's machine, edit the collectd's configuration (jinja2 template) file, and add required plugins (``/tools/lma/ansible-client/roles/collectd/files/collectd.conf.j2``).
+ For more details refer `this <https://collectd.org/wiki/index.php/First_steps>`_
+
+Where to send metrics?
+------------------------
+
+Metrics are sent to collectd exporter.
+UDP packets are sent to port 38026
+(can be configured and checked at
+``tools/lma/ansible-server/roles/monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml``)
+
+Data Management
+================================
+
+DataFlow:
+--------------
+.. image:: images/dataFlow.png
+
+Where is the data stored now?
+----------------------------------
+ - Grafana data (including dashboards) ==> On master, at ``/usr/share/monitoring_data/grafana`` (its accessed by Presistent volume via NFS)
+ - Prometheus Data ==> On VM2 and VM3, at /usr/share/monitoring_data/prometheus
+
+ **Note: Promethei data also are independent of each other, a shared data solution gave errors**
+
+Do we have backup of data?
+-------------------------------
+ Promethei even though independent scrape same targets,
+ have same alert rules, therefore generate very similar data.
+
+ Grafana's NFS part of the data has no backup
+ Dashboards' json are available in the ``/tools/lma/metrics/dashboards`` directory
+
+When containers are restarted, the data is still accessible?
+-----------------------------------------------------------------
+ Yes, unless the data directories are deleted ``(/usr/share/monitoring_data/*)`` from each node
+
+Alert Management
+==================
+
+Configure Alert receiver
+--------------------------
+- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml``
+- Under the config.yml section under receivers, add, update, delete receivers
+- Currently ip of unified alert receiver is used.
+- Alertmanager supports multiple types of receivers, you can get a `list here <https://prometheus.io/docs/alerting/latest/configuration/>`_
+
+Add new alerts
+--------------------------------------
+- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/prometheus/prometheus-config.yaml``
+- Under the data section alert.rules file is mounted on the config-map.
+- In this file alerts are divided in 4 groups, namely:
+ - targets
+ - host and hardware
+ - container
+ - kubernetes
+- Add alerts under exisiting group or add new group. Please follow the structure of the file for adding new group
+- To add new alert:
+ - Use the following structure:
+
+ alert: alertname
+
+ expr: alert rule (generally promql conditional query)
+
+ for: time-range (eg. 5m, 10s, etc, the amount of time the condition needs to be true for the alert to be triggered)
+
+ labels:
+
+ severity: critical (other severity options and other labels can be added here)
+
+ type: hardware
+
+ annotations:
+
+ summary: <summary of the alert>
+
+ description: <descibe the alert here>
+
+- For an exhaustive alerts list you can have a look `here <https://awesome-prometheus-alerts.grep.to/>`_
+
+Troubleshooting
+===============
+No metrics received in grafana plot
+---------------------------------------------
+- Check if all configurations are correctly done.
+- Go to main-prometheus's port and any one VMs' ip, and check if prometheus is getting the metrics
+- If prometheus is getting them, read grafana's logs (``kubectl -n monitoring logs <name_of_grafana_pod>``)
+- Else, have a look at collectd exporter's metrics endpoint (eg. 10.10.120.211:30103/metrics)
+- If collectd is getting them, check prometheus's config file if collectd's ip is correct over there.
+- Else ssh to master, check which node collectd-exporter is scheduled (lets say vm2)
+- Now ssh to vm2
+- Use ``tcpdump -i ens3 #the interface used to connect to the internet > testdump``
+- Grep your client node's ip and check if packets are reaching our monitoring cluster (``cat testdump | grep <ip of client>``)
+- Ideally you should see packets reaching the node, if so please see if the collectd-exporter is running correctly, check its logs.
+- If no packets are received, error is on the client side, check collectd's config file and make sure correct collectd-exporter ip is used in the ``<network>`` section.
+
+If no notification received
+---------------------------
+- Go to main-prometheus's port and any one VMs' ip,(eg. 10.10.120.211:30902) and check if prometheus is getting the metrics
+- If no, read "No metrics received in grafana plot" section, else read ahead.
+- Check IP of alert-receiver, you can see this by going to alertmanager-ip:port and check if alertmanager is configured correctly.
+- If yes, paste the alert rule in the prometheus' query-box and see if any metric staisfy the condition.
+- You may need to change alert rules in the alert.rules section of prometheus-config.yaml if there was a bug in the alert's rule. (please read the "Add new alerts" section for detailed instructions)
+
+Reference
+=========
+- `Prometheus K8S deployment <https://www.metricfire.com/blog/how-to-deploy-prometheus-on-kubernetes/>`_
+- `HA Prometheus <https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available>`_
+- `Data Flow Diagram <https://drive.google.com/file/d/1D--LXFqU_H-fqpD57H3lJFOqcqWHoF0U/view?usp=sharing>`_
+- `Collectd Configuration <https://docs.opnfv.org/en/stable-fraser/submodules/barometer/docs/release/userguide/docker.userguide.html#build-the-collectd-docker-image>`_
+- `Alertmanager Rule Config <https://awesome-prometheus-alerts.grep.to/>`_