Data visualization for newbies – Part 1
This is going to a collection of blogs by AICoreSpot devoted to varying data visualization strategies leveraged in several fields of machine learning. Data Visualization is a vital step for developing a capable and effective machine learning model. It assists us with improved understanding of the information, produce improved insights, and for feature engineering – and last, but not the least, make improved decisions over the course of modelling and training of models.
In this blog, we will leverage the seaborn and matplotlib libraries to produce the visualizations. Matplotlib is a MATLAB-like plotting system in Python, while seaborn is a Python visualization library on the basis of matplotlib. It furnishes a high-level interface for generating statistical graphics. In this blog post, we will look into differing statistical graphical strategies that can assist us in effective interpretation and comprehending the information. Although all the plots leveraging the seaborn library can be developed leveraging the matplotlib library, we typically have a preference for the seaborn library due to its capability to handle DataFrames.
We will begin by importing the two libraries. Here is the guide to setting up the matplotlib library and seaborn library. Observe that we’ll be leveraging matplotlib and seaborn libraries interchangeably dependent on the plot.
### Importing necessary library
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Simple Plot
Let’s start by plotting a simple line plot which is leveraged to plot a mathematical. A line plot is leveraged to plot the relationship amongst the two variables, we can merely call the plot function.
### Creating a figure to plot the graph.
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel(‘X data’)
ax.set_ylabel(‘Y data’)
ax.set_title(‘Relationship between variables X and Y’)
plt.show() # display the graph
### if %matplotlib inline has been invoked already, then plt.show() is automatically invoked and the plot is displayed in the same window.
Here, we can observe that the variables ‘x’ and ‘y’ possess a sinusoidal relationship. Typically, plot() function is leveraged to identify any mathematical relationship amongst the variables.
Histogram
A histogram is one of the most commonly leveraged data visualization strategies in machine learning. It indicates the distribution of an ongoing variable over a provided interval or duration of time. Histograms plot the information by demarcating it into intervals referred to as ‘bins’. It is leveraged to inspect the underlying frequency distribution (e.g. normal distribution), outliers, skewness, etc.
Let’s make an assumption about some data ‘x’ and analyse its distribution and other connected features.
### Let ‘x’ be the data with 1000 random points.
x = np.random.randn(1000)
Let’s go about plotting a histogram to undertake analysis of the distribution of ‘x’
plt.hist(x)
plt.xlabel(‘Intervals’)
plt.ylabel(‘Value’)
plt.title(‘Distribution of the variable x’)
plt.show()
The above plot illustrates a normal distribution, i.e., the variable ‘x’ features normal distribution. We can also make the inference that the distribution is relatively negatively skewed. We typically control the ‘bins’ parameters to generate a distribution with smooth boundaries. For instance, if we set down the number of ‘bins’ too low, says bins = 5, then a majority of the values get accumulated in the same interval, and as an outcome, they generate a distribution which is difficult forecast.
plt.hist(x, bins=5)
plt.xlabel(‘Intervals’)
plt.ylabel(‘Value’)
plt.title(‘Distribution of the variable x’)
plt.show()
Likewise, if we increase the number of bins to a bigger value, assume, bins = 1000, every value will function as a separate bin, and as an outcome, the distribution appears to be too arbitrary.
plt.hist(x, bins=1000)
plt.xlabel(‘Intervals’)
plt.ylabel(‘Value’)
plt.title(‘Distribution of the variable x’)
plt.show()
Kernel Density Function
Prior to diving into understanding KDE, let’s understand what parametric and non-parametric data is.
Parametric data: When we have the assumption that the info is drawn from a specific distribution and some variant of parametric test is applicable to it.
Non-Parametric data: When we possess no awareness with regards to the population and the underlying distribution
Kernel Density Function is the non-parametric method of making representations of the probability distribution function of an arbitrary variable. It is leveraged when the parametric distribution of the information doesn’t make much sense, and you wish to prevent making assumptions about the information.
The Kernel Density Estimator is the estimated PDF of an arbitrary variable. It is defined as
Like histograms, KDE plots the density of observations on one axis with height along the other axis.
### We will use the seaborn library to plot KDE.
### Let’s assume random data stored in variable ‘x’.
fig, ax = plt.subplots()
### Generating random data
x = np.random.rand(200)
sns.kdeplot(x, shade=True, ax=ax)
plt.show()
Distplot brings together the function of the histogram and the KDE plot into one figure.
### Generating a random sample
x = np.random.random_sample(1000)
### Plotting the distplot
sns.distplot(x, bins=20)
Therefore, the distplot function goes about plotting the histogram and the KDE for the sample information in the same figure. You can undertake tuning of the parameters of the distplot to only show the histogram or kde or both. Distplot is handy when you desire to visualize how close your assumption about the distribution of the information is to the actual distribution.
Scatter Plot
Scatter Plots are leveraged to decide the relationship between two variables. They illustrate how much one variable is influenced by another. It is the most typically leveraged data visualization strategy and assists in drawing good insights when contrasting two variables. The relationship amongst two variables is referred to as correlation. If the data points fit a line or a curve with a positive slope, then the two variables are stated to display positive correlation. If the line or curve features a negative slope, then the variables are stated to be negatively correlated.
A perfect +ve correlation is referred to by a value of 1 and a perfect negative correlation corresponds to a value of -1. The nearer the value is to 1 or -1, the more robust the relationship between the variables. The nearer the value is to 0, the weaker the correlation.
For our instance, let’s provide definition to three variables: ‘x’, ‘y’, and ‘z’ where ‘x’ and ‘z’ are arbitrarily generated data and y is defined as
We will leverage a scatter plot to identify the relationship between the variables ‘x’ and ‘y’.\
### Let’s define the variables we want to find the relationship between.
x = np.random.rand(500)
z = np.random.rand(500)
### Defining the variable ‘y’
y = x * (z + x)
fig, ax = plt.subplots()
ax.set_xlabel(‘X’)
ax.set_ylabel(‘Y’)
ax.set_title(‘Scatter plot between X and Y’)
plt.scatter(x, y, marker=’.’)
plt.show()
From the image above we can observe that the data points are very near to one another and also if we fit a curve, combined with the points, it will possess a positive slope. Therefore, we can make the inference that there is a strong positive correlation amongst the values of the variable ‘y’ and the variable ‘x’.
Additionally, we can observe that the curve that best fits the graph is quadratic in nature and we can go about confirming this by understanding the definition of the variable y’.
Joint Plot
Jointplot is seaborn library particular and can be leveraged to quickly visualize and undertake analysis of the relationship amongst two variables and detail their individual distributions on the same plot.
Let’s begin with leveraging joint plot for producing the scatter plot.
### Defining the data.
mean, covar = [0, 1], [[1, 0,], [0, 50]]
### Drawing random samples from a multivariate normal distribution.
### Two random variables are created, each containing 500 values, with the given mean and covariance.
data = np.random.multivariate_normal(mean, covar, 500)
### Storing the variables in a dataframe.
df = pd.DataFrame(data=data, columns=[‘X’, ‘Y’])
Next, we can leverage the joint point to identify the best line or curve that is apt for the plot.
sns.jointplot(df.X, df.Y, kind=’reg’)
plt.show()
Aside from this, jointplot can also be leveraged to plot ‘kde’, ‘hex plot’ and ‘residual plot’.
PairPlot
We can leverage scatter plot to plot the relationship amongst two variables. But what if the dataset possesses in excess of two variables – which is typically the scenario, it can be a monotonous activity to visualize the relationship between every variable with the other variables.
The seaborn pairplot function does the identical thing for us and in merely one line of coding. It is leveraged to plot several pairwise bivariate (two variable) distribution in a dataset. It develops a matrix and plots the relationship for every pair of columns. It also draws a univariate distribution for every variable on the diagonal axes.
### Loading a dataset from the sklearn toy datasets
from sklearn.datasets import load_linnerud
### Loading the data
linnerud_data = load_linnerud()
### Extracting the column data
data = linnerud_data.data
Sklearn records data in the form of a numpy array and not data frames, therefore recording the information in a dataframe.
### Creating a dataframe
data = pd.DataFrame(data=data, columns=diabetes_data.feature_names)
### Plotting a pairplot
sns.pairplot(data=data)
So in the graph above, we can view the relationships between each of the variables with the other and therefore make the inference which variables possess the most correlation.
Conclusion
Visualizations play a critical part in data analysis and exploration. In this blog post, we looked at differing varieties of plots leveraged in the data analysis of continuous variables.