Feature Selection with regards to Machine Learning in Python
The data features that you leverage to go about training your machine learning models have a major impact on the performance you can accomplish.
Irrelevant or part-relevant features can have a negative influence on the model performance.
In this blog post by AICoreSpot, you will find out about automatic feature selection strategies that you can leverage to prep your machine learning data in python leveraging scikit-learn.
Feature Selection
Feature Selection is a procedure where you automatically choose the features in your data that give most to the prediction variable or output in which you are interested.
Possessing irrelevant features in your data can reduce the precision of several models, particularly linear algorithms such as linear and logistic regression.
Three advantages of carrying out Feature Selection prior to modelling of data are:
- Minimizes overfitting: Reduction of redundant data implies reduced opportunity to make decisions on the basis of noise.
- Enhances Precision: Reduction of misleading data implies modelling precision appreciates.
- Minimizes training time: Reduced data implies that algorithms go about training quicker.
Feature Selection for Machine Learning
This section contains four feature selection recipes for machine learning in Python.
This post consists of recipes for feature selection strategies.
Every recipe was developed to be complete and standalone so that you can just copy-and-paste it straight into your project and leverage it instantly.
Recipes leverages the Pima Indians onset of diabetes dataset to illustrate the feature selection strategy. This is binary classification issue where all of the attributes are numeric:
1. Univariate Selection
Statistical evaluations can be leveraged to choose those features that have the most robust relationship with the output variable.
The scikit-learn library furnishes the SelectKBest class that can be leveraged with a suite of differing statistical evaluations to choose a particular number of features.
Several differing statistical tests can be leveraged with this selection strategy. For instance, the ANOVA F-value methodology is apt for numerical inputs and categorical information, as we observe in the Pima dataset. This can be leveraged through f_classif() function. We will choose the four best features leveraging this method in the instance below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Feature Selection with Univariate Statistical Tests from pandas import read_csv from numpy import set_printoptions from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif # load data filename = ‘pima-indians-diabetes.data.csv’ names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction test = SelectKBest(score_func=f_classif, k=4) fit = test.fit(X, Y) # summarize scores set_printoptions(precision=3) print(fit.scores_) features = fit.transform(X) # summarize selected features print(features[0:5,:]) |
Your outcomes may have variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
You can observe the scores for every attribute and the four attributes opted for (ones with the highest scores). Particularly features with indexes 0 (preq), 1 (plas), 5 (mass), and 7 (age).
[Control]
1 2 3 4 5 6 7 | [ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]
[[ 6. 148. 33.6 50. ] [ 1. 85. 26.6 31. ] [ 8. 183. 23.3 32. ] [ 1. 89. 28.1 21. ] [ 0. 137. 43.1 33. ]] |
2. Recursive Feature Elimination
The Recursive Feature Elimination of RFE functions by recursively eradicating attributes and developing a model on those attributes that remain.
It leverages the model precision to detect which attributes (and combo of attributes) give the most to forecasting the target attribute.
The instance below leverages RFE with the logistic regression algorithm to choose the top 3 features. The option of algorithm does not make too much of a difference as long as it is skilful and consistent.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # Feature Extraction with RFE from pandas import read_csv from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression # load data url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv” names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = LogisticRegression(solver=’lbfgs’) rfe = RFE(model, 3) fit = rfe.fit(X, Y) print(“Num Features: %d” % fit.n_features_) print(“Selected Features: %s” % fit.support_) print(“Feature Ranking: %s” % fit.ranking_) |
You can observe that RFE opted for the leading 3 features as preg, mass, and pedi.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few time and contrast the average outcome.
These are denoted True in the support_array and indicated with a choice “1” in the ranking_array.
1 2 3 | Num Features: 3 Selected Features: [ True False False False False True True False] Feature Ranking: [1 2 3 5 6 1 1 4] |
3. Principal Component Analysis
Principal Component Analysis or PCA leverages linear algebra to convert the dataset into a compressed format.
Typically this is referred to as a data reduction strategy. An attribute of PCA is that you can select the number of dimensions or principal component in the converted outcome.
In the instance below, we leverage PCA and choose three principal components.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Feature Extraction with PCA import numpy from pandas import read_csv from sklearn.decomposition import PCA # load data url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv” names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction pca = PCA(n_components=3) fit = pca.fit(X) # summarize components print(“Explained Variance: %s” % fit.explained_variance_ratio_) print(fit.components_) |
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment strategy, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
You can observe that the transformed dataset (three principal components) have minimal resemblance to the source data.
1 2 3 4 5 6 7 | Explained Variance: [ 0.88854663 0.06159078 0.02579012] [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03] [ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02 -9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01] [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]] |
4. Feature Importance
Bagged decision trees such as Random Forest and Extra Trees can be leveraged to estimate the criticality of features.
In the instance below we build an ExtraTressClassifier for the Pima Indians onset of diabetes dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # Feature Importance with Extra Trees Classifier from pandas import read_csv from sklearn.ensemble import ExtraTreesClassifier # load data url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv” names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’] dataframe = read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] # feature extraction model = ExtraTreesClassifier(n_estimators=10) model.fit(X, Y) print(model.feature_importances_) |
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment strategy, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
You can observe that we are provided a criticality score for every attribute where the bigger score the more critical the attribute. The scores indicate at the criticality of plas, age, and mass.
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]
Conclusion
In this blog post, you found out about feature selection for prepping machine learning data in Python with scikit-learn.
You learned about four differing automatic feature selection strategies.
- Univariate Selection.
- Recursive Feature Elimination.
- Principle Component Analysis.
- Feature Importance.