diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/lma/devguide.rst | 147 | ||||
-rw-r--r-- | docs/lma/logs/images/elasticsearch.png | bin | 0 -> 36046 bytes | |||
-rw-r--r-- | docs/lma/logs/images/fluentd-cs.png | bin | 0 -> 40226 bytes | |||
-rw-r--r-- | docs/lma/logs/images/fluentd-ss.png | bin | 0 -> 18331 bytes | |||
-rw-r--r-- | docs/lma/logs/images/nginx.png | bin | 0 -> 36737 bytes | |||
-rw-r--r-- | docs/lma/logs/images/setup.png | bin | 0 -> 43503 bytes | |||
-rw-r--r-- | docs/lma/logs/userguide.rst | 348 | ||||
-rw-r--r-- | docs/lma/metrics/devguide.rst | 474 | ||||
-rw-r--r-- | docs/lma/metrics/images/dataflow.png | bin | 0 -> 42443 bytes | |||
-rw-r--r-- | docs/lma/metrics/images/setup.png | bin | 0 -> 15019 bytes | |||
-rw-r--r-- | docs/lma/metrics/userguide.rst | 230 |
11 files changed, 1199 insertions, 0 deletions
diff --git a/docs/lma/devguide.rst b/docs/lma/devguide.rst new file mode 100644 index 00000000..c72b8b12 --- /dev/null +++ b/docs/lma/devguide.rst @@ -0,0 +1,147 @@ +================= +Table of Contents +================= +.. contents:: +.. section-numbering:: + +Ansible Client-side +==================== + +Ansible File Organisation +-------------------------- +Files Structure:: + + ansible-client + ├── ansible.cfg + ├── hosts + ├── playbooks + │ └── setup.yaml + └── roles + ├── clean-td-agent + │ └── tasks + │ └── main.yml + └── td-agent + ├── files + │ └── td-agent.conf + └── tasks + └── main.yml + +Summary of roles +----------------- +====================== ====================== +Roles Description +====================== ====================== +``td-agent`` Install Td-agent & change configuration file +``clean-td-agent`` Unistall Td-agent +====================== ====================== + +Configurable Parameters +------------------------ +====================================================== ====================== ====================== +File (ansible-client/roles/) Parameter Description +====================================================== ====================== ====================== +``td-agent/files/td-agent.conf`` host Fluentd-server IP +``td-agent/files/td-agent.conf`` port Fluentd-Server Port +====================================================== ====================== ====================== + +Ansible Server-side +==================== + +Ansible File Organisation +-------------------------- +Files Structure:: + + ansible-server + ├── ansible.cfg + ├── group_vars + │ └── all.yml + ├── hosts + ├── playbooks + │ └── setup.yaml + └── roles + ├── clean-logging + │ └── tasks + │ └── main.yml + ├── k8s-master + │ └── tasks + │ └── main.yml + ├── k8s-pre + │ └── tasks + │ └── main.yml + ├── k8s-worker + │ └── tasks + │ └── main.yml + ├── logging + │ ├── files + │ │ ├── elastalert + │ │ │ ├── ealert-conf-cm.yaml + │ │ │ ├── ealert-key-cm.yaml + │ │ │ ├── ealert-rule-cm.yaml + │ │ │ └── elastalert.yaml + │ │ ├── elasticsearch + │ │ │ ├── elasticsearch.yaml + │ │ │ └── user-secret.yaml + │ │ ├── fluentd + │ │ │ ├── fluent-cm.yaml + │ │ │ ├── fluent-service.yaml + │ │ │ └── fluent.yaml + │ │ ├── kibana + │ │ │ └── kibana.yaml + │ │ ├── namespace.yaml + │ │ ├── nginx + │ │ │ ├── nginx-conf-cm.yaml + │ │ │ ├── nginx-key-cm.yaml + │ │ │ ├── nginx-service.yaml + │ │ │ └── nginx.yaml + │ │ ├── persistentVolume.yaml + │ │ └── storageClass.yaml + │ └── tasks + │ └── main.yml + └── nfs + └── tasks + └── main.yml + +Summary of roles +----------------- +====================== ====================== +Roles Description +====================== ====================== +``k8s-pre`` Pre-requisite for installing K8s, like installing docker & K8s, disable swap etc. +``k8s-master`` Reset K8s & make a master +``k8s-worker`` Join woker nodes with token +``logging`` EFK & elastalert setup in K8s +``clean logging`` Remove EFK & elastalert setup from K8s +``nfs`` Start a NFS server to store Elasticsearch data +====================== ====================== + +Configurable Parameters +------------------------ +========================================================================= ============================================ ====================== +File (ansible-server/roles/) Parameter name Description +========================================================================= ============================================ ====================== +**Role: logging** +``logging/files/persistentVolume.yaml`` storage Increase or Decrease Storage size of Persistent Volume size for each VM +``logging/files/kibana/kibana.yaml`` version To Change the Kibana Version +``logging/files/kibana/kibana.yaml`` count To increase or decrease the replica +``logging/files/elasticsearch/elasticsearch.yaml`` version To Change the Elasticsearch Version +``logging/files/elasticsearch/elasticsearch.yaml`` nodePort To Change Service Port +``logging/files/elasticsearch/elasticsearch.yaml`` storage Increase or Decrease Storage size of Elasticsearch data for each VM +``logging/files/elasticsearch/elasticsearch.yaml`` nodeAffinity -> values (hostname) In which VM Elasticsearch master or data pod will run (change the hostname to run the Elasticsearch master or data pod on a specific node) +``logging/files/elasticsearch/user-secret.yaml`` stringData Add Elasticsearch User & its roles (`Elastic Docs <https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-users-and-roles.html#k8s_file_realm>`_) +``logging/files/fluentd/fluent.yaml`` replicas To increase or decrease the replica +``logging/files/fluentd/fluent-service.yaml`` nodePort To Change Service Port +``logging/files/fluentd/fluent-cm.yaml`` index_template.json -> number_of_replicas To increase or decrease replica of data in Elasticsearch +``logging/files/fluentd/fluent-cm.yaml`` fluent.conf Server port & other Fluentd Configuration +``logging/files/nginx/nginx.yaml`` replicas To increase or decrease the replica +``logging/files/nginx/nginx-service.yaml`` nodePort To Change Service Port +``logging/files/nginx/nginx-key-cm.yaml`` kibana-access.key, kibana-access.pem Key file for HTTPs Connection +``logging/files/nginx/nginx-conf-cm.yaml`` - Nginx Configuration +``logging/files/elastalert/elastalert.yaml`` replicas To increase or decrease the replica +``logging/files/elastalert/ealert-key-cm.yaml`` elastalert.key, elastalert.pem Key file for HTTPs Connection +``logging/files/elastalert/ealert-conf-cm.yaml`` run_every How often ElastAlert will query Elasticsearch +``logging/files/elastalert/ealert-conf-cm.yaml`` alert_time_limit If an alert fails for some reason, ElastAlert will retry sending the alert until this time period has elapsed +``logging/files/elastalert/ealert-conf-cm.yaml`` es_host, es_port Elasticsearch Serivce name & port in K8s +``logging/files/elastalert/ealert-rule-cm.yaml`` http_post_url Alert Receiver IP (`Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_) +**Role: nfs** +``nfs/tasks/main.yml`` line Path of NFS storage +========================================================================= ============================================ ====================== diff --git a/docs/lma/logs/images/elasticsearch.png b/docs/lma/logs/images/elasticsearch.png Binary files differnew file mode 100644 index 00000000..f0b876f5 --- /dev/null +++ b/docs/lma/logs/images/elasticsearch.png diff --git a/docs/lma/logs/images/fluentd-cs.png b/docs/lma/logs/images/fluentd-cs.png Binary files differnew file mode 100644 index 00000000..513bb3ef --- /dev/null +++ b/docs/lma/logs/images/fluentd-cs.png diff --git a/docs/lma/logs/images/fluentd-ss.png b/docs/lma/logs/images/fluentd-ss.png Binary files differnew file mode 100644 index 00000000..4e9ab112 --- /dev/null +++ b/docs/lma/logs/images/fluentd-ss.png diff --git a/docs/lma/logs/images/nginx.png b/docs/lma/logs/images/nginx.png Binary files differnew file mode 100644 index 00000000..a0b00514 --- /dev/null +++ b/docs/lma/logs/images/nginx.png diff --git a/docs/lma/logs/images/setup.png b/docs/lma/logs/images/setup.png Binary files differnew file mode 100644 index 00000000..267685fa --- /dev/null +++ b/docs/lma/logs/images/setup.png diff --git a/docs/lma/logs/userguide.rst b/docs/lma/logs/userguide.rst new file mode 100644 index 00000000..b410ee6c --- /dev/null +++ b/docs/lma/logs/userguide.rst @@ -0,0 +1,348 @@ +================= +Table of Contents +================= +.. contents:: +.. section-numbering:: + +Setup +====== + +Prerequisites +------------------------- +- Require 3 VMs to setup K8s +- ``$ sudo yum install ansible`` +- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module) +- Update IPs in all these files (if changed) + ====================================================================== ====================== + Path Description + ====================================================================== ====================== + ``ansible-server/group_vars/all.yml`` IP of K8s apiserver and VM hostname + ``ansible-server/hosts`` IP of VMs to install + ``ansible-server/roles/logging/files/persistentVolume.yaml`` IP of NFS-Server + ``ansible-server/roles/logging/files/elastalert/ealert-rule-cm.yaml`` IP of alert-receiver + ====================================================================== ====================== + +Architecture +-------------- +.. image:: images/setup.png + +Installation - Clientside +------------------------- + +Nodes +````` +- **Node1** = 10.10.120.21 +- **Node4** = 10.10.120.24 + +How installation is done? +````````````````````````` +- TD-agent installation + ``$ curl -L https://toolbelt.treasuredata.com/sh/install-redhat-td-agent3.sh | sh`` +- Copy the TD-agent config file in **Node1** + ``$ cp tdagent-client-config/node1.conf /etc/td-agent/td-agent.conf`` +- Copy the TD-agent config file in **Node4** + ``$ cp tdagent-client-config/node4.conf /etc/td-agent/td-agent.conf`` +- Restart the service + ``$ sudo service td-agent restart`` + +Installation - Serverside +------------------------- + +Nodes +````` +Inside Jumphost - POD12 + - **VM1** = 10.10.120.211 + - **VM2** = 10.10.120.203 + - **VM3** = 10.10.120.204 + + +How installation is done? +````````````````````````` +**Using Ansible:** + - **K8s** + - **Elasticsearch:** 1 Master & 1 Data node at each VM + - **Kibana:** 1 Replicas + - **Nginx:** 2 Replicas + - **Fluentd:** 2 Replicas + - **Elastalert:** 1 Replica (get duplicate alert, if increase replica) + - **NFS Server:** at each VM to store elasticsearch data at following path + - ``/srv/nfs/master`` + - ``/srv/nfs/data`` + +How to setup? +````````````` +- **To setup K8s cluster and EFK:** Run the ansible-playbook ``ansible/playbooks/setup.yaml`` +- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml`` + +Do we have HA? +```````````````` +Yes + +Configuration +============= + +K8s +--- +Path of all yamls (Serverside) +```````````````````````````````` +``ansible-server/roles/logging/files/`` + +K8s namespace +````````````` +``logging`` + +K8s Service details +```````````````````` +``$ kubectl get svc -n logging`` + +Elasticsearch Configuration +--------------------------- + +Elasticsearch Setup Structure +````````````````````````````` +.. image:: images/elasticsearch.png + +Elasticsearch service details +````````````````````````````` +| **Service Name:** ``logging-es-http`` +| **Service Port:** ``9200`` +| **Service Type:** ``ClusterIP`` + +How to get elasticsearch default username & password? +````````````````````````````````````````````````````` +- User1 (custom user): + | **Username:** ``elasticsearch`` + | **Password:** ``password123`` +- User2 (by default created by Elastic Operator): + | **Username:** ``elastic`` + | To get default password: + | ``$ PASSWORD=$(kubectl get secret -n logging logging-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')`` + | ``$ echo $PASSWORD`` + +How to increase replica of any index? +```````````````````````````````````````` +| $ curl -k -u "elasticsearch:password123" -H 'Content-Type: application/json' -XPUT "https://10.10.120.211:9200/indexname*/_settings" -d ' +| { +| "index" : { +| "number_of_replicas" : "2" } +| }' + +Index Life +``````````` +**30 Days** + +Kibana Configuration +-------------------- + +Kibana Service details +```````````````````````` +| **Service Name:** ``logging-kb-http`` +| **Service Port:** ``5601`` +| **Service Type:** ``ClusterIP`` + +Nginx Configuration +-------------------- +IP +```` +https://10.10.120.211:32000 + +Nginx Setup Structure +````````````````````` +.. image:: images/nginx.png + +Ngnix Service details +````````````````````` +| **Service Name:** ``nginx`` +| **Service Port:** ``32000`` +| **Service Type:** ``NodePort`` + +Why NGINX is used? +``````````````````` +`Securing ELK using Nginx <https://logz.io/blog/securing-elk-nginx/>`_ + +Nginx Configuration +```````````````````` +**Path:** ``ansible-server/roles/logging/files/nginx/nginx-conf-cm.yaml`` + +Fluentd Configuration - Clientside (Td-agent) +--------------------------------------------- + +Fluentd Setup Structure +```````````````````````` +.. image:: images/fluentd-cs.png + +Log collection paths +````````````````````` +- ``/tmp/result*/*.log`` +- ``/tmp/result*/*.dat`` +- ``/tmp/result*/*.csv`` +- ``/tmp/result*/stc-liveresults.dat.*`` +- ``/var/log/userspace*.log`` +- ``/var/log/sriovdp/*.log.*`` +- ``/var/log/pods/**/*.log`` + +Logs sends to +````````````` +Another fluentd instance of K8s cluster (K8s Master: 10.10.120.211) at Jumphost. + +Td-agent logs +````````````` +Path of td-agent logs: ``/var/log/td-agent/td-agent.log`` + +Td-agent configuration +```````````````````````` +| Path of conf file: ``/etc/td-agent/td-agent.conf`` +| **If any changes is made in td-agent.conf then restart the td-agent service,** ``$ sudo service td-agent restart`` + +Config Description +```````````````````` +- Get the logs from collection path +- | Convert to this format + | { + | msg: "log line" + | log_path: “/file/path” + | file: “file.name” + | host: “pod12-node4” + | } +- Sends it to fluentd + +Fluentd Configuration - Serverside +---------------------------------- + +Fluentd Setup Structure +```````````````````````` +.. image:: images/fluentd-ss.png + +Fluentd Service details +```````````````````````` +| **Service Name:** ``fluentd`` +| **Service Port:** ``32224`` +| **Service Type:** ``NodePort`` + +Logs sends to +````````````` +Elasticsearch service (https://logging-es-http:9200) + +Config Description +```````````````````` +- **Step 1** + - Get the logs from Node1 & Node4 +- **Step 2** + ======================================== ====================== + log_path add tag (for routing) + ======================================== ====================== + ``/tmp/result.*/.*errors.dat`` errordat.log + ``/tmp/result.*/.*counts.dat`` countdat.log + ``/tmp/result.*/stc-liveresults.dat.tx`` stcdattx.log + ``/tmp/result.*/stc-liveresults.dat.rx`` stcdatrx.log + ``/tmp/result.*/.*Statistics.csv`` ixia.log + ``/tmp/result.*/vsperf-overall*`` vsperf.log + ``/tmp/result.*/vswitchd*`` vswitchd.log + ``/var/log/userspace*`` userspace.log + ``/var/log/sriovdp*`` sriovdp.log + ``/var/log/pods*`` pods.log + ======================================== ====================== + +- **Step 3** + Then parse each type using tags. + - error.conf: to find any error + - time-series.conf: to parse time series data + - time-analysis.conf: to calculate time analyasis +- **Step 4** + ================================ ====================== + host add tag (for routing) + ================================ ====================== + ``pod12-node4`` node4 + ``worker`` node1 + ================================ ====================== +- **Step 5** + ================================ ====================== + Tag elasticsearch + ================================ ====================== + ``node4`` index “node4*” + ``node1`` index “node1*” + ================================ ====================== + +Elastalert +---------- + +Send alert if +`````````````` +- Blacklist + - "Failed to run test" + - "Failed to execute in '30' seconds" + - "('Result', 'Failed')" + - "could not open socket: connection refused" + - "Input/output error" + - "dpdk|ERR|EAL: Error - exiting with code: 1" + - "Failed to execute in '30' seconds" + - "dpdk|ERR|EAL: Driver cannot attach the device" + - "dpdk|EMER|Cannot create lock on" + - "dpdk|ERR|VHOST_CONFIG: * device not found" +- Time + - vswitch_duration > 3 sec + +How to configure alert? +```````````````````````` +- Add your rule in ``ansible/roles/logging/files/elastalert/ealert-rule-cm.yaml`` (`Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_) + | name: anything + | type: <check-above-link> #The RuleType to use + | index: node4* #index name + | realert: + | minutes: 0 #to get alert for all cases after each interval + | alert: post #To send alert as HTTP POST + | http_post_url: "http://url" + +- Mount this file to elastalert pod in ``ansible/roles/logging/files/elastalert/elastalert.yaml``. + +Alert Format +```````````` +{"type": "pattern-match", "label": "failed", "index": "node4-20200815", "log": "error-log-line", "log-path": "/tmp/result/file.log", "reson": "error-message" } + +Data Management +=============== + +Elasticsearch +------------- + +Where data is stored now? +````````````````````````` +Data is stored in NFS server with 1 replica of each index (default). Path of data are following: + - ``/srv/nfs/data (VM1)`` + - ``/srv/nfs/data (VM2)`` + - ``/srv/nfs/data (VM3)`` + - ``/srv/nfs/master (VM1)`` + - ``/srv/nfs/master (VM2)`` + - ``/srv/nfs/master (VM3)`` +If user wants to change from NFS to local storage +`````````````````````````````````````````````````` +Yes, user can do this, need to configure persistent volume. (``ansible-server/roles/logging/files/persistentVolume.yaml``) + +Do we have backup of data? +```````````````````````````` +1 replica of each index + +When K8s restart, the data is still accessible? +````````````````````````````````````````````````````` +Yes (If data is not deleted from /srv/nfs/data) + +Troubleshooting +=============== +If no logs receiving in Elasticsearch +-------------------------------------- +- Check IP & port of server-fluentd in client config. +- Check client-fluentd logs, ``$ sudo tail -f /var/log/td-agent/td-agent.log`` +- Check server-fluentd logs, ``$ sudo kubectl logs -n logging <fluentd-pod-name>`` + +If no notification received +--------------------------- +- Search your "log" in Elasticsearch. +- Check config of elastalert +- Check IP of alert-receiver + +Reference +========= +- `Elastic cloud on K8s <https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html>`_ +- `HA Elasticsearch on K8s <https://www.elastic.co/blog/high-availability-elasticsearch-on-kubernetes-with-eck-and-gke>`_ +- `Fluentd Configuration <https://docs.fluentd.org/configuration/config-file>`_ +- `Elastalert Rule Config <https://elastalert.readthedocs.io/en/latest/ruletypes.html>`_
\ No newline at end of file diff --git a/docs/lma/metrics/devguide.rst b/docs/lma/metrics/devguide.rst new file mode 100644 index 00000000..93d33016 --- /dev/null +++ b/docs/lma/metrics/devguide.rst @@ -0,0 +1,474 @@ +==================== +Metrics Dev Guide +==================== +Table of Contents +================= +.. contents:: +.. section-numbering:: + + +Anible File Organization +============================ + +Ansible-Server +---------------- + +Please follow the following file structure: + +.. code-block:: bash + + ansible-server + | ansible.cfg + | hosts + | + +---group_vars + | all.yml + | + +---playbooks + | clean.yaml + | setup.yaml + | + \---roles + +---clean-monitoring + | \---tasks + | main.yml + | + +---monitoring + +---files + | | monitoring-namespace.yaml + | | + | +---alertmanager + | | alertmanager-config.yaml + | | alertmanager-deployment.yaml + | | alertmanager-service.yaml + | | alertmanager1-deployment.yaml + | | alertmanager1-service.yaml + | | + | +---cadvisor + | | cadvisor-daemonset.yaml + | | cadvisor-service.yaml + | | + | +---collectd-exporter + | | collectd-exporter-deployment.yaml + | | collectd-exporter-service.yaml + | | + | +---grafana + | | grafana-datasource-config.yaml + | | grafana-deployment.yaml + | | grafana-pv.yaml + | | grafana-pvc.yaml + | | grafana-service.yaml + | | + | +---kube-state-metrics + | | kube-state-metrics-deployment.yaml + | | kube-state-metrics-service.yaml + | | + | +---node-exporter + | | nodeexporter-daemonset.yaml + | | nodeexporter-service.yaml + | | + | \---prometheus + | main-prometheus-service.yaml + | prometheus-config.yaml + | prometheus-deployment.yaml + | prometheus-pv.yaml + | prometheus-pvc.yaml + | prometheus-service.yaml + | prometheus1-deployment.yaml + | prometheus1-service.yaml + | + \---tasks + main.yml + + +Ansible - Client +------------------ + +Please follow the following file structure: + +.. code-block:: bash + + ansible-server + | ansible.cfg + | hosts + | + +---group_vars + | all.yml + | + +---playbooks + | clean.yaml + | setup.yaml + | + \---roles + +---clean-collectd + | \---tasks + | main.yml + | + +---collectd + +---files + | collectd.conf.j2 + | + \---tasks + main.yml + + +Summary of Roles +================== + +A brief description of the Ansible playbook roles, +which are used to deploy the monitoring cluster + +Ansible Server Roles +---------------------- + +Ansible Server, this part consists of the roles used to deploy +Prometheus Alertmanager Grafana stack on the server-side + +Role: Monitoring +~~~~~~~~~~~~~~~~~~ + +Deployment and configuration of PAG stack along with collectd-exporter, +cadvisor and node-exporter. + +Role: Clean-Monitoring +~~~~~~~~~~~~~~~~~~~~~~~~ + +Removes all the components deployed by the Monitoring role. + + +File-Task Mapping and Configurable Parameters +================================================ + +Ansible Server +---------------- + +Role: Monitoring +~~~~~~~~~~~~~~~~~~~ + +Alert Manager +^^^^^^^^^^^^^^^ + +File: alertmanager-config.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-config.yaml + +Task: Configures Receivers for alertmanager + +Summary: A configmap, currently configures webhook for alertmanager, +can be used to configure any kind of receiver + +Configurable Parameters: + receiver.url: change to the webhook receiver's URL + route: Can be used to add receivers + + +File: alertmanager-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-deployment.yaml + +Task: Deploys alertmanager instance + +Summary: A Deployment, deploys 1 replica of alertmanager + + +File: alertmanager-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/alertmanager/alertmanager-service.yaml + +Task: Creates a K8s service for alertmanager + +Summary: A Nodeport type of service, so that user can create "silences", +view the status of alerts from the native alertmanager dashboard / UI. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + +**Note: alertmanager1-deployment, alertmanager1-service are the same as +alertmanager-deployment and alertmanager-service respectively.** + +CAdvisor +^^^^^^^^^^^ + +File: cadvisor-daemonset.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/cadvisor/cadvisor-daemonset.yaml + +Task: To create a cadvisor daemonset + +Summary: A daemonset, used to scrape data of the kubernetes cluster itself, +its a daemonset so an instance is run on every node. + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: cadvisor-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/cadvisor/cadvisor-service.yaml + +Task: To create a cadvisor service + +Summary: A ClusterIP service for cadvisor to communicate with prometheus + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Collectd Exporter +^^^^^^^^^^^^^^^^^^^^ + +File: collectd-exporter-deployment.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml + +Task: To create a collectd replica + +Summary: A deployment, acts as receiver for collectd data sent by client machines, +prometheus pulls data from this exporter + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: collectd-exporter.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/collectd-exporter/collectd-exporter.yaml + +Task: To create a collectd service + +Summary: A NodePort service for collectd-exporter to hold data for prometheus +to scrape + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Grafana +^^^^^^^^^ + +File: grafana-datasource-config.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-datasource-config.yaml + +Task: To create config file for grafana + +Summary: A configmap, adds prometheus datasource in grafana + + +File: grafana-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-deployment.yaml + +Task: To create a grafana deployment + +Summary: The grafana deployment creates a single replica of grafana, +with preconfigured prometheus datasource. + +Configurable Parameters: + spec.template.spec.ports: Edit ports + spec.template.spec.env: Add / Edit environment variables + + +File: grafana-pv.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-pv.yaml + +Task: To create a persistent volume for grafana + +Summary: A persistent volume for grafana. + +Configurable Parameters: + spec.capacity.storage: Increase / decrease size + spec.accessModes: To change the way PV is accessed. + spec.nfs.server: To change the ip address of NFS server + spec.nfs.path: To change the path of the server + + +File: grafana-pvc.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-pvc.yaml + +Task: To create a persistent volume claim for grafana + +Summary: A persistent volume claim for grafana. + +Configurable Parameters: + spec.resources.requests.storage: Increase / decrease size + + +File: grafana-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/grafana/grafana-service.yaml + +Task: To create a service for grafana + +Summary: A Nodeport type of service, so that users actually connect to, +view the dashboard / UI. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + + +Kube State Metrics +^^^^^^^^^^^^^^^^^^^^ + +File: kube-state-metrics-deployment.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/kube-state-metrics/kube-state-metrics-deployment.yaml + +Task: To create a kube-state-metrics instance + +Summary: A deployment, used to collect metrics of the kubernetes cluster iteself + +Configurable Parameters: + spec.template.spec.containers.ports: Port of the container + + +File: kube-state-metrics-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/kube-state-metrics/kube-state-metrics-service.yaml + +Task: To create a collectd service + +Summary: A NodePort service for collectd-exporter to hold data for prometheus +to scrape + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Node Exporter +^^^^^^^^^^^^^^^ + +File: node-exporter-daemonset.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/node-exporter/node-exporter-daemonset.yaml + +Task: To create a node exporter daemonset + +Summary: A daemonset, used to scrape data of the host machines / node, +its a daemonset so an instance is run on every node. + +Configurable Parameters: + spec.template.spec.ports: Port of the container + + +File: node-exporter-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/node-exporter/node-exporter-service.yaml + +Task: To create a node exporter service + +Summary: A ClusterIP service for node exporter to communicate with Prometheus + +Configurable Parameters: + spec.ports: Add / Edit ports + + +Prometheus +^^^^^^^^^^^^^ + +File: prometheus-config.yaml +'''''''''''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-config.yaml + +Task: To create a config file for Prometheus + +Summary: A configmap, adds alert rules. + +Configurable Parameters: + data.alert.rules: Add / Edit alert rules + + +File: prometheus-deployment.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-deployment.yaml + +Task: To create a Prometheus deployment + +Summary: The Prometheus deployment creates a single replica of Prometheus, +with preconfigured Prometheus datasource. + +Configurable Parameters: + spec.template.spec.affinity: To change the node affinity, + make sure only 1 instance of prometheus is + running on 1 node. + + spec.template.spec.ports: Add / Edit container port + + +File: prometheus-pv.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-pv.yaml + +Task: To create a persistent volume for Prometheus + +Summary: A persistent volume for Prometheus. + +Configurable Parameters: + spec.capacity.storage: Increase / decrease size + spec.accessModes: To change the way PV is accessed. + spec.hostpath.path: To change the path of the volume + + +File: prometheus-pvc.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-pvc.yaml + +Task: To create a persistent volume claim for Prometheus + +Summary: A persistent volume claim for Prometheus. + +Configurable Parameters: + spec.resources.requests.storage: Increase / decrease size + + +File: prometheus-service.yaml +''''''''''''''''''''''''''''''''' +Path : monitoring/files/prometheus/prometheus-service.yaml + +Task: To create a service for prometheus + +Summary: A Nodeport type of service, prometheus native dashboard +available here. + +Configurable Parameters: + spec.type: Options : NodePort, ClusterIP, LoadBalancer + spec.ports: Edit / add ports to be handled by the service + + +File: main-prometheus-server.yaml +''''''''''''''''''''''''''''''''''' +Path: monitoring/files/prometheus/main-prometheus-service.yaml + +Task: A service that connects both prometheus instances. + +Summary: A Nodeport service for other services to connect to the Prometheus cluster. +As HA Prometheus needs to independent instances of Prometheus scraping the same inputs +having the same configuration + +**Note: prometheus-deployment, prometheus1-service are the same as +prometheus-deployment and prometheus-service respectively.** + + +Ansible Client Roles +---------------------- + +Role: Collectd +~~~~~~~~~~~~~~~~~~ + +File: main.yml +^^^^^^^^^^^^^^^^ +Path: collectd/tasks/main.yaml + +Task: Install collectd along with prerequisites + +Associated template file: + +- collectd.conf.j2 +Path: collectd/files/collectd.conf.j2 + +Summary: Edit this file to change the default configuration to +be installed on the client's machine diff --git a/docs/lma/metrics/images/dataflow.png b/docs/lma/metrics/images/dataflow.png Binary files differnew file mode 100644 index 00000000..ca1ec908 --- /dev/null +++ b/docs/lma/metrics/images/dataflow.png diff --git a/docs/lma/metrics/images/setup.png b/docs/lma/metrics/images/setup.png Binary files differnew file mode 100644 index 00000000..ce6a1274 --- /dev/null +++ b/docs/lma/metrics/images/setup.png diff --git a/docs/lma/metrics/userguide.rst b/docs/lma/metrics/userguide.rst new file mode 100644 index 00000000..0ee4a238 --- /dev/null +++ b/docs/lma/metrics/userguide.rst @@ -0,0 +1,230 @@ +================= +Metrics +================= +Table of Contents +================= +.. contents:: +.. section-numbering:: + +Setup +======= + +Prerequisites +------------------------- +- Require 3 VMs to setup K8s +- ``$ sudo yum install ansible`` +- ``$ pip install openshift pyyaml kubernetes`` (required for ansible K8s module) +- Update IPs in all these files (if changed) + - ``ansible-server/group_vars/all.yml`` (IP of apiserver and hostname) + - ``ansible-server/hosts`` (IP of VMs to install) + - ``ansible-server/roles/monitoring/files/grafana/grafana-pv.yaml`` (IP of NFS-Server) + - ``ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` (IP of alert-receiver) + +Setup Structure +--------------- +.. image:: images/setup.png + +Installation - Client Side +---------------------------- + +Nodes +````` +- **Node1** = 10.10.120.21 +- **Node4** = 10.10.120.24 + +How installation is done? +````````````````````````` +Ansible playbook available in ``tools/lma/ansible-client`` folder + +- ``cd tools/lma/ansible-client`` +- ``ansible-playbook setup.yaml`` + +This deploys collectd and configures it to send data to collectd exporter +configured at 10.10.120.211 (ip address of current instance of collectd-exporter) +Please make appropriate changes in the config file present in ``tools/lma/ansible-client/roles/collectd/files/`` + +Installation - Server Side +---------------------------- + +Nodes +`````` + +Inside Jumphost - POD12 + - **VM1** = 10.10.120.211 + - **VM2** = 10.10.120.203 + - **VM3** = 10.10.120.204 + + +How installation is done? +````````````````````````` +**Using Ansible:** + - **K8s** + - **Prometheus:** 2 independent deployments + - **Alertmanager:** 2 independent deployments (cluster peers) + - **Grafana:** 1 Replica deployment + - **cAdvisor:** 1 daemonset, i.e 3 replicas, one on each node + - **collectd-exporter:** 1 Replica + - **node-exporter:** 1 statefulset with 3 replicas + - **kube-state-metrics:** 1 deployment + - **NFS Server:** at each VM to store grafana data at following path + - ``/usr/share/monitoring_data/grafana`` + +How to setup? +````````````` +- **To setup K8s cluster, EFK and PAG:** Run the ansible-playbook ``ansible/playbooks/setup.yaml`` +- **To clean everything:** Run the ansible-playbook ``ansible/playbooks/clean.yaml`` + +Do we have HA? +```````````````` +Yes + +Configuration +============= + +K8s +--- +Path to all yamls (Server Side) +```````````````````````````````` +``tools/lma/ansible-server/roles/monitoring/files/`` + +K8s namespace +````````````` +``monitoring`` + +Configuration +--------------------------- + +Serivces and Ports +`````````````````````````` + +Services and their ports are listed below, +one can go to IP of any node on the following ports, +service will correctly redirect you + + + ====================== ======= + Service Port + ====================== ======= + Prometheus 30900 + Prometheus1 30901 + Main-Prometheus 30902 + Alertmanager 30930 + Alertmanager1 30931 + Grafana 30000 + Collectd-exporter 30130 + ====================== ======= + +How to change Configuration? +------------------------------ +- Ports, names of the containers, pretty much every configuration can be modified by changing the required values in the respective yaml files (``/tools/lma/ansible-server/roles/monitoring/``) +- For metrics, on the client's machine, edit the collectd's configuration (jinja2 template) file, and add required plugins (``/tools/lma/ansible-client/roles/collectd/files/collectd.conf.j2``). + For more details refer `this <https://collectd.org/wiki/index.php/First_steps>`_ + +Where to send metrics? +------------------------ + +Metrics are sent to collectd exporter. +UDP packets are sent to port 38026 +(can be configured and checked at +``tools/lma/ansible-server/roles/monitoring/files/collectd-exporter/collectd-exporter-deployment.yaml``) + +Data Management +================================ + +DataFlow: +-------------- +.. image:: images/dataFlow.png + +Where is the data stored now? +---------------------------------- + - Grafana data (including dashboards) ==> On master, at ``/usr/share/monitoring_data/grafana`` (its accessed by Presistent volume via NFS) + - Prometheus Data ==> On VM2 and VM3, at /usr/share/monitoring_data/prometheus + + **Note: Promethei data also are independent of each other, a shared data solution gave errors** + +Do we have backup of data? +------------------------------- + Promethei even though independent scrape same targets, + have same alert rules, therefore generate very similar data. + + Grafana's NFS part of the data has no backup + Dashboards' json are available in the ``/tools/lma/metrics/dashboards`` directory + +When containers are restarted, the data is still accessible? +----------------------------------------------------------------- + Yes, unless the data directories are deleted ``(/usr/share/monitoring_data/*)`` from each node + +Alert Management +================== + +Configure Alert receiver +-------------------------- +- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/alertmanager/alertmanager-config.yaml`` +- Under the config.yml section under receivers, add, update, delete receivers +- Currently ip of unified alert receiver is used. +- Alertmanager supports multiple types of receivers, you can get a `list here <https://prometheus.io/docs/alerting/latest/configuration/>`_ + +Add new alerts +-------------------------------------- +- Go to file ``/tools/lma/ansible-server/roles/monitoring/files/prometheus/prometheus-config.yaml`` +- Under the data section alert.rules file is mounted on the config-map. +- In this file alerts are divided in 4 groups, namely: + - targets + - host and hardware + - container + - kubernetes +- Add alerts under exisiting group or add new group. Please follow the structure of the file for adding new group +- To add new alert: + - Use the following structure: + + alert: alertname + + expr: alert rule (generally promql conditional query) + + for: time-range (eg. 5m, 10s, etc, the amount of time the condition needs to be true for the alert to be triggered) + + labels: + + severity: critical (other severity options and other labels can be added here) + + type: hardware + + annotations: + + summary: <summary of the alert> + + description: <descibe the alert here> + +- For an exhaustive alerts list you can have a look `here <https://awesome-prometheus-alerts.grep.to/>`_ + +Troubleshooting +=============== +No metrics received in grafana plot +--------------------------------------------- +- Check if all configurations are correctly done. +- Go to main-prometheus's port and any one VMs' ip, and check if prometheus is getting the metrics +- If prometheus is getting them, read grafana's logs (``kubectl -n monitoring logs <name_of_grafana_pod>``) +- Else, have a look at collectd exporter's metrics endpoint (eg. 10.10.120.211:30103/metrics) +- If collectd is getting them, check prometheus's config file if collectd's ip is correct over there. +- Else ssh to master, check which node collectd-exporter is scheduled (lets say vm2) +- Now ssh to vm2 +- Use ``tcpdump -i ens3 #the interface used to connect to the internet > testdump`` +- Grep your client node's ip and check if packets are reaching our monitoring cluster (``cat testdump | grep <ip of client>``) +- Ideally you should see packets reaching the node, if so please see if the collectd-exporter is running correctly, check its logs. +- If no packets are received, error is on the client side, check collectd's config file and make sure correct collectd-exporter ip is used in the ``<network>`` section. + +If no notification received +--------------------------- +- Go to main-prometheus's port and any one VMs' ip,(eg. 10.10.120.211:30902) and check if prometheus is getting the metrics +- If no, read "No metrics received in grafana plot" section, else read ahead. +- Check IP of alert-receiver, you can see this by going to alertmanager-ip:port and check if alertmanager is configured correctly. +- If yes, paste the alert rule in the prometheus' query-box and see if any metric staisfy the condition. +- You may need to change alert rules in the alert.rules section of prometheus-config.yaml if there was a bug in the alert's rule. (please read the "Add new alerts" section for detailed instructions) + +Reference +========= +- `Prometheus K8S deployment <https://www.metricfire.com/blog/how-to-deploy-prometheus-on-kubernetes/>`_ +- `HA Prometheus <https://prometheus.io/docs/introduction/faq/#can-prometheus-be-made-highly-available>`_ +- `Data Flow Diagram <https://drive.google.com/file/d/1D--LXFqU_H-fqpD57H3lJFOqcqWHoF0U/view?usp=sharing>`_ +- `Collectd Configuration <https://docs.opnfv.org/en/stable-fraser/submodules/barometer/docs/release/userguide/docker.userguide.html#build-the-collectd-docker-image>`_ +- `Alertmanager Rule Config <https://awesome-prometheus-alerts.grep.to/>`_ |