Principal Component Analysis for Visualization
Principal component analysis (PCA) is an unsupervised ML strategy. Probably the most widespread leveraging of principal component analysis is dimensionality reduction. Aside from leveraging PCA as a data prep strategy, we can additionally leverage it to assist visualize data. An image is worth a million words, as they say. With the data visualization, it is simpler for us to obtain some insight and deliberate on the subsequent step in our machine learning models.
In this guide, you will find out how to visualize data leveraging PCA, as well as leveraging visualization to assist in determination of the parameter for dimensionality reduction.
After going through this guide, you will be aware of:
- How to visualize high dimensional data
- Explained variance within PCA
- Visually observe the explained variance from the outcome of PCA of high dimensional data
Tutorial Summarization
This guide is subdivided into two portions, which are:
- Scatter plot of high dimensional data
- Visualizing the explained variance
Prerequisites
For this guide, the assumption is that you are already acquainted with:
- Calculating Principal Component Analysis (PCA) from Scratch in Python
- Principal Component Analysis for Dimensionality Reduction within Python
Scatter plot of high dimensional data
Visualization is a critical step to obtain insight from the data. We can know from visualization that if a platform can be observed and therefore provide an estimation as to which machine learning model/framework is apt.
It is simple to demonstrate things in two dimension. Usually a scatter plot with x- and y-axis are in two dimensional. Demonstrating things in 3D is a bit of a challenge but not undoable. In matplotlib, for instance, can plot in 3D. The only issue is on paper or on screen, we are required to just look at a 3D plot at a single viewport or projection at a time. Within matplotlib, this is managed by the degree of evaluation and azimuth. Illustrating things in four or five dimensions is impossible as we live in a 3D world and possess no notion of how things in such a high dimension would appear like.
This is where a dimensionality reductions strategy like PCA becomes a factor. We can minimize the dimension to two or three so we can go about visualizing it. Let’s begin with an instance:
We begin with the wine dataset, which is a classification dataset with 13 features and 3 classes. There are a total of 178 samples.
1 2 3 4 5 | from sklearn.datasets import load_wine winedata = load_wine() X, y = winedata[‘data’], winedata[‘target’] print(X.shape) print(y.shape) |
(178, 13)
(178,)
Among the thirteen features, we can choose any two and plot them with matplotlib (we color-coded the differing classes leveraging the c argument)
1 2 3 4 | … import matplotlib.pyplot as plt plt.scatter(X[:,1], X[:,2], c=y) plt.show() |
Or we can additionally choose any three and show in 3D:
1 2 3 4 | … ax = fig.add_subplot(projection=’3d’) ax.scatter(X[:,1], X[:,2], X[:,3], c=y) plt.show() |
How these don’t unveil much of how the information appears like, as majority of the features are not displayed. We now resort to principal component analysis:
1 2 3 4 5 6 7 | … from sklearn.decomposition import PCA pca = PCA() Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’])) plt.show() |
Here we transform the input data X by PCA into Xt. We take up just the first two columns, which contains the majority of data, and plot it in two dimensional. We can observe that the purple class is quite distinctive, however there is some overlap. However, if we scale the data prior to PCA, the outcome would be different.
1 2 3 4 5 6 7 8 9 | … from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pca = PCA() pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)]) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’])) plt.show() |
However PCA is sensitive to the scale, if we normalized every feature by StandardScaler we can observe an improved outcome. Here the differing classes are more unique. By observing this plot, we are confident that a simplistic model like SVM can categorize this dataset in high precision.
Bringing these together, the following is the total code to produce the visualizations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | from sklearn.datasets import load_wine from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt
# Load dataset winedata = load_wine() X, y = winedata[‘data’], winedata[‘target’] print(“X shape:”, X.shape) print(“y shape:”, y.shape)
# Show any two features plt.figure(figsize=(8,6)) plt.scatter(X[:,1], X[:,2], c=y) plt.xlabel(winedata[“feature_names”][1]) plt.ylabel(winedata[“feature_names”][2]) plt.title(“Two particular features of the wine dataset”) plt.show()
# Show any three features fig = plt.figure(figsize=(10,8)) ax = fig.add_subplot(projection=’3d’) ax.scatter(X[:,1], X[:,2], X[:,3], c=y) ax.set_xlabel(winedata[“feature_names”][1]) ax.set_ylabel(winedata[“feature_names”][2]) ax.set_zlabel(winedata[“feature_names”][3]) ax.set_title(“Three particular features of the wine dataset”) plt.show()
# Show first two principal components without scaler pca = PCA() plt.figure(figsize=(8,6)) Xt = pca.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’])) plt.xlabel(“PC1”) plt.ylabel(“PC2”) plt.title(“First two principal components”) plt.show()
# Show first two principal components with scaler pca = PCA() pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(winedata[‘target_names’])) plt.xlabel(“PC1”) plt.ylabel(“PC2”) plt.title(“First two principal components after scaling”) plt.show() |
If we go about applying the same strategy on a differing dataset, like MINST handwritten digits, the scatterplot is not displaying distinctive boundary and thus it requires a more complex model/framework like neural network to classify:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | from sklearn.datasets import load_digits from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt
digitsdata = load_digits() X, y = digitsdata[‘data’], digitsdata[‘target’] pca = PCA() pipe = Pipeline([(‘scaler’, StandardScaler()), (‘pca’, pca)]) plt.figure(figsize=(8,6)) Xt = pipe.fit_transform(X) plot = plt.scatter(Xt[:,0], Xt[:,1], c=y) plt.legend(handles=plot.legend_elements()[0], labels=list(digitsdata[‘target_names’])) plt.show() |
Visualizing the explained variance
PCA, basically is to rearrange the features by their linear combos. Therefore it is referred to as a feature extraction technique. One characteristic of PCA is that the initial principal component holds the most data with regards to the dataset. The second principal component is more informative than the third, and so on.
To demonstrate this idea, we can eradicate the principal components from the original dataset in steps and observe how the dataset appears like. Let’s take up a dataset with lesser features, and demonstrate two features in a plot:
1 2 3 4 5 6 | from sklearn.datasets import load_iris irisdata = load_iris() X, y = irisdata[‘data’], irisdata[‘target’] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.show() |
The iris dataset only possesses four features. The features are in comparable scales and therefore we can skip the scaler. With a 4-features data, the PCA can generate at most 4 principal components.
1 2 3 | … pca = PCA().fit(X) print(pca.components_) |
1 2 3 4 | [[ 0.36138659 -0.08452251 0.85667061 0.3582892 ] [ 0.65658877 0.73016143 -0.17337266 -0.07548102] [-0.58202985 0.59791083 0.07623608 0.54583143] [-0.31548719 0.3197231 0.47983899 -0.75365743]] |
For instance, the first row is the first principal axis on which the first principal component is developed. For any data point p with features p = (a, b, c, d), as the principal axis is denoted by the vector v = (0.36 – 0.08, 0.86, 0.36), the initial principal component of this data point possess the value 0.36 x a-0.08 x b + 0.86 x c + 0.36 x d on the principal axis. Leveraging vector dot product, this value can be signified by
p · v
Thus, with the dataset X as a 150 x 4 matrix (150 data points, each possess 4 features), we can map every data point into the value on this principal axis by matrix-vector multiplication.
X x v
And the outcome is a vector of length 150. Now if we remove from every data point corresponding value along the principal axis vector, that would be:
X –(X x v) x vT
Where the transposed vector vT is a row and X x v is a column. The product (X x v) x vT follows matrix-matrix multiplication and the outcome is a 150 x 4 matrix, same dimension as X.
If we plot the initial two features of (X x v) x vT, it appears as follows:
1 2 3 4 5 6 7 8 | … # Remove PC1 Xmean = X – X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X – pc1 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show() |
The numpy array Xmean is to shift the features of X to centred at zero. This is needed for PCA. Then the array value is computed by matrix-vector multiplication. The array value is the magnitude of each data point mapped on the principal axis. So if we multiply this value to the principal axis vector we obtain an array pc1. Removing this from the original dataset x, we obtain new array Xremove. In the plot we made the observations that the points on the scatter plot crumbled together and the cluster of every class is less distinctive than prior. This implies we remove a ton of data by removing the initial principal component. If we repeat the same procedure again, the points are further crumbled.
1 2 3 4 5 6 7 | … # Remove PC2 value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove – pc2 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show() |
This appears like a straight line but actually not. If we repeat it once more, all the points collapse into a straight line.
1 2 3 4 5 6 7 | … # Remove PC3 value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove – pc3 plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.show() |
The points all fall on a straight line as we removed three principal components from the information where there are just four features. Therefore, our data matrix becomes rank 1. You can attempt to repeat this procedure once more and the outcome would be all points collapse into a singular point. The amount of data eradicated in every step as removed the principal components can be identified by the corresponding explained variance ratio from the PCA.
1 2 | … print(pca.explained_variance_ratio_) |
[0.92461872 0.05306648 0.01710261 0.00521218]
Here, we can observe, the initial component illustrated 92.5% variance and the 2nd component illustrated 5.3% variance. If we deleted the initial two principal components, the remainder variance is just 2.2%, therefore visually the plot after removing dual components appears like a straight line. As a matter of fact, when we check with the plots above, not only do we observe the points are crumbled, but the range in x- and y-axes are also smaller as we deleted the components.
In terms of machine learning, we can consider leveraging just a singular feature for classification in this dataset, specifically the first principal component. We should expect to accomplish no lesser than 9/10ths of the original precision as leveraging the complete grouping of features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | … from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from collections import Counter
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) from sklearn.svm import SVC clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train, y_train) print(“Using all features, accuracy: “, clf.score(X_test, y_test)) print(“Using all features, F1: “, f1_score(y_test, clf.predict(X_test), average=”macro”))
mean = X_train.mean(axis=0) X_train2 = X_train – mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train2, y_train) X_test2 = X_test – mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print(“Using PC1, accuracy: “, clf.score(X_test2, y_test)) print(“Using PC1, F1: “, f1_score(y_test, clf.predict(X_test2), average=”macro”)) |
1 2 3 4 | Using all features, accuracy: 1.0 Using all features, F1: 1.0 Using PC1, accuracy: 0.96 Using PC1, F1: 0.9645191409897292 |
The other leveraging of comprehending the explained variance is on compression. Provided the explained variance of the initial principal component is large, if we require to record the dataset, we can record just the projected values on the initial principal axis (X x v), in addition to the vector v of the principal axis. Then we can approximately reproduce the original dataset through their multiplication:
X ≈ (X x v) x vT
In this fashion, we require storage for just a singular value per data point rather than four values for four features. The approximation is more precise if we record the projected values on several principal axes and add up several principal components.
Combining these together, the following is the complete code to produce the visualizations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA from sklearn.metrics import f1_score from sklearn.svm import SVC import matplotlib.pyplot as plt
# Load iris dataset irisdata = load_iris() X, y = irisdata[‘data’], irisdata[‘target’] plt.figure(figsize=(8,6)) plt.scatter(X[:,0], X[:,1], c=y) plt.xlabel(irisdata[“feature_names”][0]) plt.ylabel(irisdata[“feature_names”][1]) plt.title(“Two features from the iris dataset”) plt.show()
# Show the principal components pca = PCA().fit(X) print(“Principal components:”) print(pca.components_)
# Remove PC1 Xmean = X – X.mean(axis=0) value = Xmean @ pca.components_[0] pc1 = value.reshape(-1,1) @ pca.components_[0].reshape(1,-1) Xremove = X – pc1 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata[“feature_names”][0]) plt.ylabel(irisdata[“feature_names”][1]) plt.title(“Two features from the iris dataset after removing PC1”) plt.show()
# Remove PC2 Xmean = X – X.mean(axis=0) value = Xmean @ pca.components_[1] pc2 = value.reshape(-1,1) @ pca.components_[1].reshape(1,-1) Xremove = Xremove – pc2 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata[“feature_names”][0]) plt.ylabel(irisdata[“feature_names”][1]) plt.title(“Two features from the iris dataset after removing PC1 and PC2”) plt.show()
# Remove PC3 Xmean = X – X.mean(axis=0) value = Xmean @ pca.components_[2] pc3 = value.reshape(-1,1) @ pca.components_[2].reshape(1,-1) Xremove = Xremove – pc3 plt.figure(figsize=(8,6)) plt.scatter(Xremove[:,0], Xremove[:,1], c=y) plt.xlabel(irisdata[“feature_names”][0]) plt.ylabel(irisdata[“feature_names”][1]) plt.title(“Two features from the iris dataset after removing PC1 to PC3”) plt.show()
# Print the explained variance ratio print(“Explainedd variance ratios:”) print(pca.explained_variance_ratio_)
# Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Run classifer on all features clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train, y_train) print(“Using all features, accuracy: “, clf.score(X_test, y_test)) print(“Using all features, F1: “, f1_score(y_test, clf.predict(X_test), average=”macro”))
# Run classifier on PC1 mean = X_train.mean(axis=0) X_train2 = X_train – mean X_train2 = (X_train2 @ pca.components_[0]).reshape(-1,1) clf = SVC(kernel=”linear”, gamma=’auto’).fit(X_train2, y_train) X_test2 = X_test – mean X_test2 = (X_test2 @ pca.components_[0]).reshape(-1,1) print(“Using PC1, accuracy: “, clf.score(X_test2, y_test)) print(“Using PC1, F1: “, f1_score(y_test, clf.predict(X_test2), average=”macro”)) |
Further Reading
Books
- Deep Learning
APIs
- scikit-learn toy datasets
- scikit-learn iris dataset
- scikit-learn wine dataset
- matplotlib scatter API
- The mplot3D toolkit
Conclusion
In this guide, you found out how to visualize data leveraging principal component analysis.
Particularly, you learned about:
- Visualizing a high dimensional dataset in 2D leveraging PCA
- How to leverage the plot in PCA dimensions to assist in selecting a relevant machine learning model
- How to observe the explained variance ratio of PCA
- What the explained ratio implies for machine learning