summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorSridhar K. N. Rao <srao@linuxfoundation.org>2023-01-02 11:04:44 +0530
committerSridhar K. N. Rao <srao@linuxfoundation.org>2023-01-02 11:10:35 +0530
commit7e53707db0a93c21a64f00ac8308a342c29e5a7b (patch)
tree2a2f279a6178f580231b8801871650c872ed6d55
parent9e9df400ba7f9259a38484d232fe11e08edb4da4 (diff)
Tools: Time Series Analysis
This patch adds time series analysis Signed-off-by: Sridhar K. N. Rao <srao@linuxfoundation.org> Change-Id: I162ab447167aa21c19b9056883748a42328db43f
-rw-r--r--tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb825
1 files changed, 825 insertions, 0 deletions
diff --git a/tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb b/tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb
new file mode 100644
index 0000000..81d960b
--- /dev/null
+++ b/tools/timeseriesanalysis/TimeSeriesAnalysis.ipynb
@@ -0,0 +1,825 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "10f11159",
+ "metadata": {},
+ "source": [
+ "Contributors: Sridhar K. N. Rao\n",
+ "\n",
+ "Copyright 2022-23 The Linux Foundation\n",
+ "\n",
+ "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
+ "\n",
+ "http://www.apache.org/licenses/LICENSE-2.0\n",
+ "\n",
+ "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "02ef34c2",
+ "metadata": {},
+ "source": [
+ "# Time-Series Analysis\n",
+ "\n",
+ "This Notebook includes time-series analysis of:\n",
+ "\n",
+ "1. CPU\n",
+ "2. Memory\n",
+ "3. System Load\n",
+ "\n",
+ "The analysis includes both individual and comparative analysis "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6f524f5e",
+ "metadata": {},
+ "source": [
+ "## Data Preparation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "88b1a616",
+ "metadata": {},
+ "source": [
+ "### CPU"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7f4b9f76",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import pandas as pd\n",
+ "\n",
+ "# If there are separate cpu-<n> files for each core, then we will have combine them.\n",
+ "# Here combining is done by taking the average value.\n",
+ "\n",
+ "paths_one = []\n",
+ "for path, currentDirectory, files in os.walk(\"/data/pod18-node4/\"):\n",
+ " for file in files:\n",
+ " if file.startswith(\"percent-idle\"):\n",
+ " #print(file)\n",
+ " paths_one.append(os.path.join(path, file))\n",
+ "\n",
+ "dfs_one = (pd.read_csv(f, index_col=False) for f in paths_one)\n",
+ "data_one = pd.concat(dfs_one).groupby(level=0).mean()\n",
+ "\n",
+ "# if the data is already in a single file\n",
+ "data_two = pd.read_csv('/data/node5/percent-idle.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dc4f81f7",
+ "metadata": {},
+ "source": [
+ "### Memory"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b89e54f2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import pandas as pd\n",
+ "data_one = pd.read_csv('/data/pod18-node4/memory/memory-used-2022-06-19', index_col=False)\n",
+ "data_two = pd.read_csv('/data/pod18-node5/memory/memory-used.csv', index_col=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be90260a",
+ "metadata": {},
+ "source": [
+ "### System Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "334a1799",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import pandas as pd\n",
+ "data_one = pd.read_csv('/data/pod18-node4/load/load-2022-06-19', index_col=False)\n",
+ "data_two = pd.read_csv('/data/pod18-node5/load/load.csv', index_col=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9fc3fac7",
+ "metadata": {},
+ "source": [
+ "### Convert Epoch, Set Index make the size of two dataset equal"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "775d6e5a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_one['epoch'] = pd.to_datetime(data_one['epoch'],unit='s')\n",
+ "data_one.set_index('epoch', inplace=True)\n",
+ "data_two['epoch'] = pd.to_datetime(data_two['epoch'],unit='s')\n",
+ "data_two.set_index('epoch', inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ad5fd126",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "if data_one.shape[0] > data_two.shape[0]:\n",
+ " diff = data_one.shape[0] - data_two.shape[0]\n",
+ " data_one = data_one.drop(data_one.index[(data_one.shape[0] - diff):])\n",
+ "else:\n",
+ " diff2 = data_two.shape[0] - data_one.shape[0]\n",
+ " data_two = data_two.drop(data_two.index[(data_two.shape[0] - diff2):])\n",
+ "\n",
+ "print(data_one.shape[0])\n",
+ "print(data_two.shape[0])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8857cb43",
+ "metadata": {},
+ "source": [
+ "## Autocorrelation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "03624971",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from pandas.plotting import autocorrelation_plot\n",
+ "from statsmodels.tsa.arima.model import ARIMA"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dbb2c76b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "autocorrelation_plot(data_one)\n",
+ "pyplot.show()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "99fb7a81",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "autocorrelation_plot(data_two)\n",
+ "pyplot.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fadd1df4",
+ "metadata": {},
+ "source": [
+ "## ARIMA"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "95a925f5",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "c1_copy = data_one.copy(deep=True)\n",
+ "c2_copy = data_two.copy(deep=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f7b197a9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "c1_copy.index = c1_copy.index.to_period('s')\n",
+ "c1_copy.index"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d1eed816",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#concat_two.index = concat_two.index.to_timestamp(freq='s')\n",
+ "c2_copy.index = c2_copy.index.to_period('s')\n",
+ "c2_copy.index"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "95124eb2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = ARIMA(c2_copy, order=(1,1,0))\n",
+ "model_fit = model.fit()\n",
+ "print(model_fit.summary())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "bd8314cf",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "model = ARIMA(c1_copy, order=(1,1,0))\n",
+ "model_fit = model.fit()\n",
+ "print(model_fit.summary())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b5771d3",
+ "metadata": {},
+ "source": [
+ "## Histograms"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ef9daa03",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_two.hist(column=\"value\", bins=20)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c4d30477",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_one.hist(column=\"value\", bins=20)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2b15015e",
+ "metadata": {},
+ "source": [
+ "### Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "cb0adbc7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_one.hist(column=\"longterm\", bins=20)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e19191b6",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_two.hist(column=\"longterm\", bins=20)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "700d9d01",
+ "metadata": {},
+ "source": [
+ "# Compute Probabilities"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2b906c19",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "740fa874",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_one['value'] = data_one['value'].round(decimals=0)\n",
+ "probabilities_one = data_one['value'].value_counts(normalize=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4fecd994",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_two['value'] = data_two['value'].round(decimals=0)\n",
+ "probabilities_two = data_two['value'].value_counts(normalize=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "916cb16d",
+ "metadata": {},
+ "source": [
+ "### Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a9ab7583",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_one['longterm'] = data_one['longterm'].round(decimals=0)\n",
+ "probabilities_one = data_one['longterm'].value_counts(normalize=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dfe0f350",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "data_two['longterm'] = data_two['longterm'].round(decimals=0)\n",
+ "probabilities_two = data_two['longterm'].value_counts(normalize=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4887cdcc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sns.barplot(x=probabilities_one.index, y=probabilities_one.values, color='blue')\n",
+ "sns.barplot(x=probabilities_two.index, y=probabilities_two.values, color='blue')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4fcfe0f0",
+ "metadata": {},
+ "source": [
+ "# Comparative Analysis"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ae4a8fe7",
+ "metadata": {},
+ "source": [
+ "## Dynamic Time Warping"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b7765d1c",
+ "metadata": {},
+ "source": [
+ "## Interpretation\n",
+ "Dynamic time warping is a seminal time series comparison technique.The objective of time series comparison methods is to produce a distance metric between two input time series. The similarity or dissimilarity of two-time series is typically calculated by converting the data into vectors and calculating the Euclidean distance between those points in vector space."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d2a9001d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np \n",
+ "from scipy.spatial.distance import euclidean\n",
+ "from fastdtw import fastdtw\n",
+ "x = data_one.to_numpy()\n",
+ "y = data_two.to_numpy()\n",
+ "distance, path = fastdtw(x,y, dist=euclidean)\n",
+ "print(distance)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "2beb0134",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from scipy.spatial.distance import chebyshev, cityblock\n",
+ "values_one = data_one[['value']].to_numpy()\n",
+ "values_two = data_two[['value']].to_numpy()\n",
+ "dist, path = fastdtw(values_one, values_two, dist=cityblock)\n",
+ "print(dist)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "91943e69",
+ "metadata": {},
+ "source": [
+ "### Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e79422c2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from scipy.spatial.distance import chebyshev, cityblock\n",
+ "values_one = data_one[['longterm']].to_numpy()\n",
+ "values_two = data_two[['longterm']].to_numpy()\n",
+ "dist, path = fastdtw(values_one, values_two, dist=cityblock)\n",
+ "print(dist)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1a6cccb1",
+ "metadata": {},
+ "source": [
+ "# Wasserstein Distance"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2fb0612c",
+ "metadata": {},
+ "source": [
+ "## Interpretation\n",
+ "Wasserstein distance provide a meaningful and smooth representation of the distance between distributions. They measure the minimal effort required to reconfigure the probability mass of one distribution in order to recover the other distribution.\n",
+ "Expectation: Less than 2."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5e9c598",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from scipy.stats import wasserstein_distance\n",
+ "wd = wasserstein_distance (data_one['value'], data_two['value'])\n",
+ "wd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9991c19",
+ "metadata": {},
+ "source": [
+ "### Load"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1270dbc7",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from scipy.stats import wasserstein_distance\n",
+ "wd = wasserstein_distance (data_one['longterm'], data_two['longterm'])\n",
+ "wd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c2fff2b3",
+ "metadata": {},
+ "source": [
+ "# Maximum Mean Discrepancy"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3b533aed",
+ "metadata": {},
+ "source": [
+ "## Reference\n",
+ "https://www.kaggle.com/code/onurtunali/maximum-mean-discrepancy/notebook"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2539bbc4",
+ "metadata": {},
+ "source": [
+ "## Interpretation\n",
+ "MMD is defined by the idea of representing distances between distributions as distances between mean embeddings of features. Two distributions are similar if their moments are similar. By applying a kernel, we can transform the variable such that all moments (first, second, third etc.) are computed. In the latent space we can compute the difference between the moments and average it. This gives a measure of the similarity/dissimilarity between the datasets."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b58e4b2c",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import torch\n",
+ "\n",
+ "def MMD(x, y, kernel):\n",
+ " \"\"\"Emprical maximum mean discrepancy. The lower the result\n",
+ " the more evidence that distributions are the same.\n",
+ "\n",
+ " Args:\n",
+ " x: first sample, distribution P\n",
+ " y: second sample, distribution Q\n",
+ " kernel: kernel type such as \"multiscale\" or \"rbf\"\n",
+ " \"\"\"\n",
+ " xx, yy, zz = torch.mm(x, x.t()), torch.mm(y, y.t()), torch.mm(x, y.t())\n",
+ " rx = (xx.diag().unsqueeze(0).expand_as(xx))\n",
+ " ry = (yy.diag().unsqueeze(0).expand_as(yy))\n",
+ " \n",
+ " dxx = rx.t() + rx - 2. * xx # Used for A in (1)\n",
+ " dyy = ry.t() + ry - 2. * yy # Used for B in (1)\n",
+ " dxy = rx.t() + ry - 2. * zz # Used for C in (1)\n",
+ " \n",
+ " XX, YY, XY = (torch.zeros(xx.shape).to(device),\n",
+ " torch.zeros(xx.shape).to(device),\n",
+ " torch.zeros(xx.shape).to(device))\n",
+ " \n",
+ " if kernel == \"multiscale\":\n",
+ " \n",
+ " bandwidth_range = [0.2, 0.5, 0.9, 1.3]\n",
+ " for a in bandwidth_range:\n",
+ " XX += a**2 * (a**2 + dxx)**-1\n",
+ " YY += a**2 * (a**2 + dyy)**-1\n",
+ " XY += a**2 * (a**2 + dxy)**-1\n",
+ " \n",
+ " if kernel == \"rbf\":\n",
+ " \n",
+ " bandwidth_range = [10, 15, 20, 50]\n",
+ " for a in bandwidth_range:\n",
+ " XX += torch.exp(-0.5*dxx/a)\n",
+ " YY += torch.exp(-0.5*dyy/a)\n",
+ " XY += torch.exp(-0.5*dxy/a)\n",
+ " \n",
+ " \n",
+ "\n",
+ " return torch.mean(XX + YY - 2. * XY)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d9af1425",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "from scipy.stats import multivariate_normal\n",
+ "from scipy.stats import dirichlet \n",
+ "from torch.distributions.multivariate_normal import MultivariateNormal\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c5bd9f09",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "device = \"cpu\"\n",
+ "\n",
+ "m = 20 # sample size\n",
+ "x_mean = torch.zeros(2)+1\n",
+ "y_mean = torch.zeros(2)\n",
+ "x_cov = 2*torch.eye(2) # IMPORTANT: Covariance matrices must be positive definite\n",
+ "y_cov = 3*torch.eye(2) - 1\n",
+ "\n",
+ "px = MultivariateNormal(x_mean, x_cov)\n",
+ "qy = MultivariateNormal(y_mean, y_cov)\n",
+ "#x = px.sample([m]).to(device)\n",
+ "#y = qy.sample([m]).to(device)\n",
+ "\n",
+ "x = torch.from_numpy(data_one.values).float().to(device)\n",
+ "y = torch.from_numpy(data_two.values).float().to(device)\n",
+ "print(x.t())\n",
+ "print(y.t())\n",
+ "print(type(x))\n",
+ "result = MMD(x, y, kernel=\"multiscale\")\n",
+ "\n",
+ "print(f\"MMD result of X and Y is {result.item()}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "70d7b830",
+ "metadata": {},
+ "source": [
+ "### Load"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "52827106",
+ "metadata": {},
+ "source": [
+ "May have to change the device value and then test."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0d6ea3ae",
+ "metadata": {},
+ "source": [
+ "# Root mean square difference"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "24a16d7c",
+ "metadata": {},
+ "source": [
+ "## Interpretation\n",
+ "The closer RMSE is to 0, the more similar the generated data is to the real-data. But RMSE is returned on the same scale as the target we are emulating for and therefore there isn’t a general rule for how to interpret ranges of values.\n",
+ "For CPU usage, If it is more than 30, then we may have to relook at it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "de532612",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import mean_squared_error\n",
+ "rmse = mean_squared_error(data_one['value'], data_two['value'], squared = False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "377e78e9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(rmse)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8ec2a78d",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rmse2 = mean_squared_error(data_one['value'], data_two['value'], squared = False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "1430a84e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(rmse2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3a8a91b1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "rmse3 = mean_squared_error(data_one.sort_values(by='value')['value'], data_two.sort_values(by='value')['value'], squared = False)\n",
+ "print(rmse3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c59358b1",
+ "metadata": {},
+ "source": [
+ "# Mutual Information"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "82a7aab2",
+ "metadata": {},
+ "source": [
+ "## Interpretation\n",
+ "Mutual information describes relationships in terms of uncertainty. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. High mutual information indicates a large reduction in uncertainty; low mutual information indicates a small reduction; and zero mutual information between two random variables means the variables are independent. The least possible mutual information between quantities is 0.0. When MI is zero, the quantities are independent: neither can tell you anything about the other. Conversely, in theory there's no upper bound to what MI can be. In practice though values above 2.0 or so are uncommon. (Mutual information is a logarithmic quantity, so it increases very slowly.)\n",
+ "\n",
+ "Should be 1.5+"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "20e1bf93",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_selection import mutual_info_regression\n",
+ "\n",
+ "def make_mi_scores(X, y, discrete_features):\n",
+ " mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)\n",
+ " mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
+ " mi_scores = mi_scores.sort_values(ascending=False)\n",
+ " return mi_scores"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "030f953f",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "discrete_features = data_one.dtypes == int"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b13c38bc",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "mi_scores = make_mi_scores(data_one, data_two['value'], discrete_features)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7b5ca42b",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(mi_scores)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}