>Business >Principal Component Analysis for Visualization

Principal Component Analysis for Visualization

Principal component analysis (PCA) is an unsupervised ML strategy. Probably the most widespread leveraging of principal component analysis is dimensionality reduction. Aside from leveraging PCA as a data prep strategy, we can additionally leverage it to assist visualize data. An image is worth a million words, as they say. With the data visualization, it is simpler for us to obtain some insight and deliberate on the subsequent step in our machine learning models.

In this guide, you will find out how to visualize data leveraging PCA, as well as leveraging visualization to assist in determination of the parameter for dimensionality reduction.

After going through this guide, you will be aware of:

• How to visualize high dimensional data
• Explained variance within PCA
• Visually observe the explained variance from the outcome of PCA of high dimensional data

Tutorial Summarization

This guide is subdivided into two portions, which are:

• Scatter plot of high dimensional data
• Visualizing the explained variance

Prerequisites

For this guide, the assumption is that you are already acquainted with:

• Calculating Principal Component Analysis (PCA) from Scratch in Python
• Principal Component Analysis for Dimensionality Reduction within Python

Scatter plot of high dimensional data

Visualization is a critical step to obtain insight from the data. We can know from visualization that if a platform can be observed and therefore provide an estimation as to which machine learning model/framework is apt.

It is simple to demonstrate things in two dimension. Usually a scatter plot with x- and y-axis are in two dimensional. Demonstrating things in 3D is a bit of a challenge but not undoable. In matplotlib, for instance, can plot in 3D. The only issue is on paper or on screen, we are required to just look at a 3D plot at a single viewport or projection at a time. Within matplotlib, this is managed by the degree of evaluation and azimuth. Illustrating things in four or five dimensions is impossible as we live in a 3D world and possess no notion of how things in such a high dimension would appear like.

This is where a dimensionality reductions strategy like PCA becomes a factor. We can minimize the dimension to two or three so we can go about visualizing it. Let’s begin with an instance:

We begin with the wine dataset, which is a classification dataset with 13 features and 3 classes. There are a total of 178 samples.

(178, 13)

(178,)

Among the thirteen features, we can choose any two and plot them with matplotlib (we color-coded the differing classes leveraging the c argument)

 1234 …import matplotlib.pyplot as pltplt.scatter(X[:,1], X[:,2], c=y)plt.show()

Or we can additionally choose any three and show in 3D:

 1234 …ax = fig.add_subplot(projection=’3d’)ax.scatter(X[:,1], X[:,2], X[:,3], c=y)plt.show()

How these don’t unveil much of how the information appears like, as majority of the features are not displayed. We now resort to principal component analysis:

 1234567 …from sklearn.decomposition import PCApca = PCA()Xt = pca.fit_transform(X)plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’]))plt.show()

Here we transform the input data X by PCA into Xt. We take up just the first two columns, which contains the majority of data, and plot it in two dimensional. We can observe that the purple class is quite distinctive, however there is some overlap. However, if we scale the data prior to PCA, the outcome would be different.

 123456789 …from sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinepca = PCA()pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)])Xt = pipe.fit_transform(X)plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’]))plt.show()

However PCA is sensitive to the scale, if we normalized every feature by StandardScaler we can observe an improved outcome. Here the differing classes are more unique. By observing this plot, we are confident that a simplistic model like SVM can categorize this dataset in high precision.

Bringing these together, the following is the total code to produce the visualizations.

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152 from sklearn.datasets import load_winefrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelineimport matplotlib.pyplot as plt # Load datasetwinedata = load_wine()X, y = winedata[‘data’], winedata[‘target’]print(“X shape:”, X.shape)print(“y shape:”, y.shape) # Show any two featuresplt.figure(figsize=(8,6))plt.scatter(X[:,1], X[:,2], c=y)plt.xlabel(winedata[“feature_names”][1])plt.ylabel(winedata[“feature_names”][2])plt.title(“Two particular features of the wine dataset”)plt.show() # Show any three featuresfig = plt.figure(figsize=(10,8))ax = fig.add_subplot(projection=’3d’)ax.scatter(X[:,1], X[:,2], X[:,3], c=y)ax.set_xlabel(winedata[“feature_names”][1])ax.set_ylabel(winedata[“feature_names”][2])ax.set_zlabel(winedata[“feature_names”][3])ax.set_title(“Three particular features of the wine dataset”)plt.show() # Show first two principal components without scalerpca = PCA()plt.figure(figsize=(8,6))Xt = pca.fit_transform(X)plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’]))plt.xlabel(“PC1”)plt.ylabel(“PC2”)plt.title(“First two principal components”)plt.show() # Show first two principal components with scalerpca = PCA()pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)])plt.figure(figsize=(8,6))Xt = pipe.fit_transform(X)plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’]))plt.xlabel(“PC1”)plt.ylabel(“PC2”)plt.title(“First two principal components after scaling”)plt.show()

If we go about applying the same strategy on a differing dataset, like MINST handwritten digits, the scatterplot is not displaying distinctive boundary and thus it requires a more complex model/framework like neural network to classify:

 123456789101112131415 from sklearn.datasets import load_digitsfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelineimport matplotlib.pyplot as plt digitsdata = load_digits()X, y = digitsdata[‘data’], digitsdata[‘target’]pca = PCA()pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)])plt.figure(figsize=(8,6))Xt = pipe.fit_transform(X)plot = plt.scatter(Xt[:,0], Xt[:,1], c=y)plt.legend(handles=plot.legend_elements()[0], labels=list(digitsdata[‘target_names’]))plt.show()

Visualizing the explained variance

PCA, basically is to rearrange the features by their linear combos. Therefore it is referred to as a feature extraction technique. One characteristic of PCA is that the initial principal component holds the most data with regards to the dataset. The second principal component is more informative than the third, and so on.

To demonstrate this idea, we can eradicate the principal components from the original dataset in steps and observe how the dataset appears like. Let’s take up a dataset with lesser features, and demonstrate two features in a plot:

 123456 from sklearn.datasets import load_irisirisdata = load_iris()X, y = irisdata[‘data’], irisdata[‘target’]plt.figure(figsize=(8,6))plt.scatter(X[:,0], X[:,1], c=y)plt.show()

The iris dataset only possesses four features. The features are in comparable scales and therefore we can skip the scaler. With a 4-features data, the PCA can generate at most 4 principal components.

 123 …pca = PCA().fit(X)print(pca.components_)

 1234 [[ 0.36138659 -0.08452251  0.85667061  0.3582892 ][ 0.65658877  0.73016143 -0.17337266 -0.07548102][-0.58202985  0.59791083  0.07623608  0.54583143][-0.31548719  0.3197231   0.47983899 -0.75365743]]

For instance, the first row is the first principal axis on which the first principal component is developed. For any data point p with features p = (a, b, c, d), as the principal axis is denoted by the vector v = (0.36 – 0.08, 0.86, 0.36), the initial principal component of this data point possess the value 0.36 x a-0.08 x b + 0.86 x c + 0.36 x d on the principal axis. Leveraging vector dot product, this value can be signified by

p · v

Thus, with the dataset X as a 150 x 4 matrix (150 data points, each possess 4 features), we can map every data point into the value on this principal axis by matrix-vector multiplication.

X x v

And the outcome is a vector of length 150. Now if we remove from every data point corresponding value along the principal axis vector, that would be:

X –(X x v) x vT

Where the transposed vector vT is a row and X x v is a column. The product (X x v) x vT follows matrix-matrix multiplication and the outcome is a 150 x 4 matrix, same dimension as X.

If we plot the initial two features of (X x v) x vT, it appears as follows:

 12345678 …# Remove PC1Xmean = X – X.mean(axis=0)value = Xmean @ pca.components_[0]pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1)Xremove = X – pc1plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.show()

The numpy array Xmean is to shift the features of X to centred at zero. This is needed for PCA. Then the array value is computed by matrix-vector multiplication. The array value is the magnitude of each data point mapped on the principal axis. So if we multiply this value to the principal axis vector we obtain an array pc1. Removing this from the original dataset x, we obtain new array Xremove. In the plot we made the observations that the points on the scatter plot crumbled together and the cluster of every class is less distinctive than prior. This implies we remove a ton of data by removing the initial principal component. If we repeat the same procedure again, the points are further crumbled.

 1234567 …# Remove PC2value = Xmean @ pca.components_[1]pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1)Xremove = Xremove – pc2plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.show()

This appears like a straight line but actually not. If we repeat it once more, all the points collapse into a straight line.

 1234567 …# Remove PC3value = Xmean @ pca.components_[2]pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1)Xremove = Xremove – pc3plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.show()

The points all fall on a straight line as we removed three principal components from the information where there are just four features. Therefore, our data matrix becomes rank 1. You can attempt to repeat this procedure once more and the outcome would be all points collapse into a singular point. The amount of data eradicated in every step as removed the principal components can be identified by the corresponding explained variance ratio from the PCA.

 12 …print(pca.explained_variance_ratio_)

[0.92461872 0.05306648 0.01710261 0.00521218]

Here, we can observe, the initial component illustrated 92.5% variance and the 2nd component illustrated 5.3% variance. If we deleted the initial two principal components, the remainder variance is just 2.2%, therefore visually the plot after removing dual components appears like a straight line. As a matter of fact, when we check with the plots above, not only do we observe the points are crumbled, but the range in x- and y-axes are also smaller as we deleted the components.

In terms of machine learning, we can consider leveraging just a singular feature for classification in this dataset, specifically the first principal component. We should expect to accomplish no lesser than 9/10ths of the original precision as leveraging the complete grouping of features.

 12345678910111213141516171819 …from sklearn.model_selection import train_test_splitfrom sklearn.metrics import f1_scorefrom collections import Counter X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)from sklearn.svm import SVCclf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train, y_train)print(“Using all features, accuracy: “, clf.score(X_test, y_test))print(“Using all features, F1: “, f1_score(y_test, clf.predict(X_test), average=”macro”)) mean = X_train.mean(axis=0)X_train2 = X_train – meanX_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1)clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train2, y_train)X_test2 = X_test – meanX_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1)print(“Using PC1, accuracy: “, clf.score(X_test2, y_test))print(“Using PC1, F1: “, f1_score(y_test, clf.predict(X_test2), average=”macro”))

 1234 Using all features, accuracy:  1.0Using all features, F1:  1.0Using PC1, accuracy:  0.96Using PC1, F1:  0.9645191409897292

The other leveraging of comprehending the explained variance is on compression. Provided the explained variance of the initial principal component is large, if we require to record the dataset, we can record just the projected values on the initial principal axis (X x v), in addition to the vector v of the principal axis. Then we can approximately reproduce the original dataset through their multiplication:

X ≈ (X x v) x vT

In this fashion, we require storage for just a singular value per data point rather than four values for four features. The approximation is more precise if we record the projected values on several principal axes and add up several principal components.

Combining these together, the following is the complete code to produce the visualizations:

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879 from sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.decomposition import PCAfrom sklearn.metrics import f1_scorefrom sklearn.svm import SVCimport matplotlib.pyplot as plt # Load iris datasetirisdata = load_iris()X, y = irisdata[‘data’], irisdata[‘target’]plt.figure(figsize=(8,6))plt.scatter(X[:,0], X[:,1], c=y)plt.xlabel(irisdata[“feature_names”][0])plt.ylabel(irisdata[“feature_names”][1])plt.title(“Two features from the iris dataset”)plt.show() # Show the principal componentspca = PCA().fit(X)print(“Principal components:”)print(pca.components_) # Remove PC1Xmean = X – X.mean(axis=0)value = Xmean @ pca.components_[0]pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1)Xremove = X – pc1plt.figure(figsize=(8,6))plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.xlabel(irisdata[“feature_names”][0])plt.ylabel(irisdata[“feature_names”][1])plt.title(“Two features from the iris dataset after removing PC1”)plt.show() # Remove PC2Xmean = X – X.mean(axis=0)value = Xmean @ pca.components_[1]pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1)Xremove = Xremove – pc2plt.figure(figsize=(8,6))plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.xlabel(irisdata[“feature_names”][0])plt.ylabel(irisdata[“feature_names”][1])plt.title(“Two features from the iris dataset after removing PC1 and PC2”)plt.show() # Remove PC3Xmean = X – X.mean(axis=0)value = Xmean @ pca.components_[2]pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1)Xremove = Xremove – pc3plt.figure(figsize=(8,6))plt.scatter(Xremove[:,0], Xremove[:,1], c=y)plt.xlabel(irisdata[“feature_names”][0])plt.ylabel(irisdata[“feature_names”][1])plt.title(“Two features from the iris dataset after removing PC1 to PC3”)plt.show() # Print the explained variance ratioprint(“Explainedd variance ratios:”)print(pca.explained_variance_ratio_) # Split dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # Run classifer on all featuresclf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train, y_train)print(“Using all features, accuracy: “, clf.score(X_test, y_test))print(“Using all features, F1: “, f1_score(y_test, clf.predict(X_test), average=”macro”)) # Run classifier on PC1mean = X_train.mean(axis=0)X_train2 = X_train – meanX_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1)clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train2, y_train)X_test2 = X_test – meanX_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1)print(“Using PC1, accuracy: “, clf.score(X_test2, y_test))print(“Using PC1, F1: “, f1_score(y_test, clf.predict(X_test2), average=”macro”))

Books

• Deep Learning

APIs

• scikit-learn toy datasets
• scikit-learn iris dataset
• scikit-learn wine dataset
• matplotlib scatter API
• The mplot3D toolkit

Conclusion

In this guide, you found out how to visualize data leveraging principal component analysis.