How to calculate Feature Importance leveraging Python
Feature importance is in reference to a grouping of techniques that allocate a score to input features on the basis on how good they are at forecasting a target variable.
There are several types and sources of feature importance scores, even though famous examples consist of statistical correlational scores, coefficients calculated as part of linear models, decision trees, and permutation importance scoring.
Feature importance score have an important part to play in a predictive modelling project, which includes furnishing insights with regards to the data, insight into the model, and the basis for dimensionality reduction and feature selection that can enhance the efficiency and effectiveness of a predictive model on the issue.
In this blog post by AICoreSpot, which serves as a tutorial, you will find out about feature importance scores for machine learning in python.
After finishing this tutorial, you will be aware of:
- The part of feature importance in a predictive modelling problem
- How to calculate and review feature importance from linear models and decision trees
- How to calculate and review permutation feature importance scores
Overview
This is tutorial is demarcated into six portions, they are as follows:
- Feature Importance
- Preparation
- Check Scikit-learn Version
- Evaluate datasets
- Coefficients as Feature Importance
- Linear Regression Feature Importance
- Logistic Regression Feature Importance
- Decision tree feature importance
- CART Feature Importance
- Random Forest Feature Importance
- XGBoost Feature Importance
- Permutation Feature Importance
- Permutation Feature Importance for Classification
- Permutation Feature Importance for Regression
- Feature Selection with Importance
Feature Importance
Feature importance is in reference to a grouping of strategies for allocating scores to input features to a predictive model that indicates the comparative importance of every feature when making a forecast.
Feature importance scores can be quantified for issues that consist of forecasting a numerical value, referred to as regression, and those issues that consist of forecasting a class label, referred to as classification.
The scores are useful and can be leveraged in an array of scenarios in a predictive modelling issue, like:
- Improved comprehension of the data
- Improved understanding of a model
- Minimizing the number of input features
Feature importance scores can furnish insight into the dataset: The comparative scores can highlight which features may be most apt to the target, and the converse, which features don’t hold any relevance. This can be interpreted by a domain specialist and could be leveraged as the foundation for collecting more or differing data.
Feature importance scores can furnish insight into the model. A majority of importance scores are estimated through a predictive model that has been fit on the dataset. Inspecting the importance score furnishes insight into that particular model and which features are the most critical and least critical to the model when rendering a prediction. This is a variant of model interpretation that can be executed for those models that are compatible with it.
Feature importance can be leveraged to enhance a predictive model. This can be accomplished by leveraging the importance scores to choose those features to delete (lowest scores) or those features to retain (highest scores). This is a variant of feature selection and simplify the issue that is being modelled, quicken up the modelling procedure (removing features is referred to as dimensionality reduction), and in some scenarios, enhance the performance of the model.
Often, we desire to quantify the strength of the relationship between the predictors and the result. Ranking predictors in this fashion can be very apt when sifting through larger amounts of information.
Feature importance scores can be input to a wrapper model, like the SelectFromModel class, to execute feature selection.
There are several ways to calculate feature importance scores and several models that can be leveraged for this reason.
Probably the easiest way is to calculate simplistic coefficient statistics amongst every feature and the target variable.
In this guide, we will observe the three primary variants of more sophisticated feature importance, they are as follows:
- Feature importance from model coefficients
- Feature importance from decision trees
- Feature importance from permutation testing
Prep
Prior to diving in, let’s validate our environment and prep some test datasets.
Check Scikit-Learn version
To start with, validate that you possess a modern version of the scikit-learn library setup.
This is critical as a few of the models we will look into in this guide need an advanced version of the library.
You can verify the version of the library you have setup with the following code instance:
# check scikit-learn version
import sklearn
print(sklearn.__version__)
Running the example will print the version of the library. At the timeframe of writing, this deals with version 0.22.
You are required to be on this version of scikit-learn or higher.
0.22.1
Test Datasets
To follow-up, let’s define a few test datasets that we can leverage as the basis for illustrating and looking into feature importance scores.
Every test issue has five critical and five unimportant features, and it may be fascinating to observe which methodologies are consistent at identifying or differentiating the features on the basis of their criticality.
Classification Dataset
We will leverage the make_classificiation() function to develop a test binary classification dataset.
The data set will possess 1,000 instances, with 10 input features, five of which will be informative, and the other five will be redundant. We will fix the arbitrary number seed to make sure we obtain the same instances every time the code is executed,
An instance of creating and summarization of the dataset is provided below:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
Executing the instance develops the dataset and validates the expected number of samples and features.
(1000, 10) (1000,)
Regression Dataset
We will leverage the make_regression() function to develop a test regression dataset.
Like the classification dataset, the regression dataset will possess 1,000 instances, with 10 input features, five of which will be informative and the other five that will be redundant.
# test regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
Executing the instance creates the dataset and validates the expected number of samples and features.
(1000, 10) (1000,)
Now, let’s take a deeper look at coefficients as importance scores.
Coefficients as Feature Importance
Linear machine learning algorithms fit a model where the forecast is the weighted total of the input values.
Instances consist of linear regression, logistic regression, and extensions that add regularization, like ridge regression and the elastic net.
Each one of these algorithms identify a grouping of coefficients to leverage in the weighted total in order to make a forecast. These coefficients can be leveraged directly as ca crude variant of feature importance score.
Let’s delve deeper and look at leveraging coefficients as feature importance for classification and regression. We will fit a model on the dataset to identify the coefficients, then summarize the critical scores for every input feature and ultimately develop a bar chart to obtain an idea of the comparative criticality of the features.
Linear Regression Feature Importance
We can fit a linear regression model on the regression dataset and retrieve the coefficient property that consists of the coefficients identified for every input variable.
These coefficients can furnish the basis for a crude feature importance score. This goes by the assumption that the input variables have the same scale or have been scaled prior to fitting a model.
The complete instance of linear regression coefficients for feature importance is listed below:
# linear regression feature importance
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = LinearRegression()
# fit the model
model.fit(X, y)
Executing the instance fits the model, then reports the coefficient value for every feature.
Your results may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The scores indicate that the model identified the five critical features and marked all other features with a zero coefficient, basically deleting them from the model.
Feature: 0, Score: 0.00000
Feature: 1, Score: 12.44483
Feature: 2, Score: -0.00000
Feature: 3, Score: -0.00000
Feature: 4, Score: 93.32225
Feature: 5, Score: 86.50811
Feature: 6, Score: 26.74607
Feature: 7, Score: 3.28535
Feature: 8, Score: -0.00000
Feature: 9, Score: 0.00000
A bar chart is the developed for the feature importance scores
This strategy might also be leveraged with Ridge and ElasticNet models.
Logistic Regression Feature Importance
We can fit a logistic regression model on the regression dataset and retrieve the coeff_ property that consists of the coefficients identified for every input variable.
The coefficients can furnish the basis for a crude feature importance score. This goes by the assumption that the input variables have the same scale or have been scaled before to fitting a model.
The complete instance of logistic regression coefficients for feature importance is enlisted below:
# logistic regression for feature importance
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = LogisticRegression()
# fit the model
model.fit(X, y)
# get importance
importance = model.coef_[0]
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Executing the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance, provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
Remember this is a classification issue with classes 0 and 1. Observe that the coefficients are both positive and negative. The positive scores suggest a feature that forecasts class 1, whereas the negative scores suggest a feature that forecasts class 0.
No overt pattern of critical and non-critical features can be detected from these outcomes, at least from what can be deciphered,
Feature: 0, Score: 0.16320
Feature: 1, Score: -0.64301
Feature: 2, Score: 0.48497
Feature: 3, Score: -0.46190
Feature: 4, Score: 0.18432
Feature: 5, Score: -0.11978
Feature: 6, Score: -0.40602
Feature: 7, Score: 0.03772
Feature: 8, Score: -0.51785
Feature: 9, Score: 0.26540
A bar chart is then leveraged for the feature importance scores.
Now that we have observed the leveraging of coefficients as importance scores, let’s observe the more typical instance of decision-tree based importance scores.
Decision Tree Feature Importance
Decision Tree Algorithms such as classification and regression trees (CART) provide importance scores on the basis of reduction in the criterion leveraged to choose split points, like Gini or entropy.
The same strategy can be deployed for ensembles of decision tress, like the random forest and stochastic gradient boosting algorithms
Let’s observe a worked example of each.
CART Feature Importance
We can leverage the CART algorithm for feature importance implemented in sci-kit learn as the DecisionTreeRegressor and DecisionTreeClassifier Classes.
Upon being fit, the model furnishes a feature_importances_property which can be accessed to retrieve the relative importance scores for every input feature.
Let’s observe an instance of this for classification and regression.
CART Regression Feature Importance
The complete instance of fitting a DecisionTreeRegressor and summarizing the calculated feature importance scores is listed below.
# decision tree for feature importance on a regression problem
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = DecisionTreeRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Executing the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
This outcome indicate perhaps three of the ten features as being critical to prediction.
[Control]
1 2 3 4 5 6 7 8 9 10 | Feature: 0, Score: 0.00294 Feature: 1, Score: 0.00502 Feature: 2, Score: 0.00318 Feature: 3, Score: 0.00151 Feature: 4, Score: 0.51648 Feature: 5, Score: 0.43814 Feature: 6, Score: 0.02723 Feature: 7, Score: 0.00200 Feature: 8, Score: 0.00244 Feature: 9, Score: 0.00106 |
A bar chart is then produced for the feature importance scores.
CART Classification Feature Importance
The complete instance of fitting a DecisionTreeClassifier and summarizing the calculated feature importance scores is listed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 . | # decision tree for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # define dataset X,y= make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = DecisionTreeClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print(‘Feature: %0d, Score: %.5f’ % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show() |
Executing the instance fits the model, the reports the coefficient value for every feature. Your outcome may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The outcomes indicate perhaps four of the ten features as being critical to prediction.
Feature: 0, Score: 0.01486
Feature: 1, Score: 0.01029
Feature: 2, Score: 0.18347
Feature: 3, Score: 0.30295
Feature: 4, Score: 0.08124
Feature: 5, Score: 0.00600
Feature: 6, Score: 0.19646
Feature: 7, Score: 0.02908
Feature: 8, Score: 0.12820
Feature: 9, Score: 0.04745
A bar chart is then developed for the feature importance scores.
Random Forest Regression Feature Importance
The complete example of fitting a RandomForestRegressor and summarizing the calculated feature importance scores is listed below.
# random forest for feature importance on a regression problem
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = RandomForestRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Running the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The outcomes indicate perhaps two or three of the ten features as being critical to predicition.
Feature: 0, Score: 0.00280
Feature: 1, Score: 0.00545
Feature: 2, Score: 0.00294
Feature: 3, Score: 0.00289
Feature: 4, Score: 0.52992
Feature: 5, Score: 0.42046
Feature: 6, Score: 0.02663
Feature: 7, Score: 0.00304
Feature: 8, Score: 0.00304
Feature: 9, Score: 0.00283
A bar chart is then generated for the feature importance scores.
Random Forest Classification Feature Importance
The complete example of fitting a RandomForestClassifier and summarizing the calculated feature importance scores is listed below.
# random forest for feature importance on a classification problem
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = RandomForestClassifier()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Executing the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The outcome indicates perhaps two or three of the 10 features as being critical to forecasting.
Feature: 0, Score: 0.06523
Feature: 1, Score: 0.10737
Feature: 2, Score: 0.15779
Feature: 3, Score: 0.20422
Feature: 4, Score: 0.08709
Feature: 5, Score: 0.09948
Feature: 6, Score: 0.10009
Feature: 7, Score: 0.04551
Feature: 8, Score: 0.08830
Feature: 9, Score: 0.04493
A bar chart is subsequently developed for the feature importance scores.
XGBoost Feature Importance
XGBoost is a library that furnishes an efficient and effective implementation of the stochastic gradient boosting algorithm.
This algorithm can be leveraged with scikit-learn through the XGBRegressor and the XGBClassifier classes.
Upon fitting, the model furnishes a feature_importances_property that can be accessed to retrieve the comparative importance scores for every input feature.
This algorithm is also furnished through scikit-learn through the GradientBoostingClassifier and GradientBoostingRegressor classes and the same strategy to feature selection can be leveraged.
To start with, setup the XBBoost Library, like with pip.
sudo pip install xgboost
Then validate that the library was setup correctly and functions by checking the version number.
# check xgboost version
import xgboost
print(xgboost.__version__)
Executing the instance, you should observe the following version number or higher.
0.90
Let’s observe an instance of XGBoost for Feature Importance on regression and classification problems.
XGBoost Regression Feature Importance
The complete instance of fitting a XGBRegressor and summarizing the calculated feature importance scores is listed below:
# xgboost for feature importance on a regression problem
from sklearn.datasets import make_regression
from xgboost import XGBRegressor
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = XGBRegressor()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Running the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The results indicate perhaps two or three of the ten features as being critical to prediction.
Feature: 0, Score: 0.00060
Feature: 1, Score: 0.01917
Feature: 2, Score: 0.00091
Feature: 3, Score: 0.00118
Feature: 4, Score: 0.49380
Feature: 5, Score: 0.42342
Feature: 6, Score: 0.05057
Feature: 7, Score: 0.00419
Feature: 8, Score: 0.00124
Feature: 9, Score: 0.00491
A bar chart is then developed for the feature importance scores.
XGBoost Classification Feature Importance
The complete instance of fitting an XGBClassifier and summarization of the calculated feature importance scores is listed below.
# xgboost for feature importance on a classification problem
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = XGBClassifier()
# fit the model
model.fit(X, y)
# get importance
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Executing the instance fits the model and then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The results indicate perhaps 7/10 features as being critical to prediction.
Feature: 0, Score: 0.02464
Feature: 1, Score: 0.08153
Feature: 2, Score: 0.12516
Feature: 3, Score: 0.28400
Feature: 4, Score: 0.12694
Feature: 5, Score: 0.10752
Feature: 6, Score: 0.08624
Feature: 7, Score: 0.04820
Feature: 8, Score: 0.09357
Feature: 9, Score: 0.02220
A bar chart is then developed for the feature importance scores.
Permutation Feature Importance
The complete example of fitting a KNEighborsRegressor and summarization of the calculated permutation feature importance scores are enlisted below.
# permutation feature importance with knn for regression
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.inspection import permutation_importance
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)
# define the model
model = KNeighborsRegressor()
# fit the model
model.fit(X, y)
# perform permutation importance
results = permutation_importance(model, X, y, scoring=’neg_mean_squared_error’)
# get importance
importance = results.importances_mean
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Running the instance fits the model, then reports the coefficient value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The results indicate perhaps two or three of the ten features as being critical to forecasting.
Feature: 0, Score: 175.52007
Feature: 1, Score: 345.80170
Feature: 2, Score: 126.60578
Feature: 3, Score: 95.90081
Feature: 4, Score: 9666.16446
Feature: 5, Score: 8036.79033
Feature: 6, Score: 929.58517
Feature: 7, Score: 139.67416
Feature: 8, Score: 132.06246
Feature: 9, Score: 84.94768
A bar chart is then produced for the feature importance scores.
Permutation Feature Importance For Classification
The complete instance of fitting a KNeighborsClassifer and summarization of the calculated permutation feature importance scores are listed below:
# permutation feature importance with knn for classification
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.inspection import permutation_importance
from matplotlib import pyplot
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define the model
model = KNeighborsClassifier()
# fit the model
model.fit(X, y)
# perform permutation importance
results = permutation_importance(model, X, y, scoring=’accuracy’)
# get importance
importance = results.importances_mean
# summarize feature importance
for i,v in enumerate(importance):
print(‘Feature: %0d, Score: %.5f’ % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Running the instance fits the model, then reports the coefficients value for every feature.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The outcomes indicate perhaps two or three of the ten features as being critical to forcasting.
Feature: 0, Score: 0.04760
Feature: 1, Score: 0.06680
Feature: 2, Score: 0.05240
Feature: 3, Score: 0.09300
Feature: 4, Score: 0.05140
Feature: 5, Score: 0.05520
Feature: 6, Score: 0.07920
Feature: 7, Score: 0.05560
Feature: 8, Score: 0.05620
Feature: 9, Score: 0.03080
A bar chart is then generated with regards to the feature importance scores.
Feature Selection with Importance
Feature Importance scores can be leveraged to assist interpreting the data, however they can also be leveraged directly to assist rank and select features that are most critical to a predictive model.
Remember, our synthetic dataset possesses 1,000 instances each one with 10 input variables, five of which are redundant/irrelevant and five of which are critical to the result. We can leverage feature importance scores to assist in choosing the five variables that are apt and just use them as inputs to a predictive model.
To start with, we can demarcate the training dataset into train and test sets and go about training a model on the training dataset, make forecasts on the evaluation set and assess the outcome leveraging classification precision.
We will leverage a logistic regression model as the predictive model.
The furnishes a baseline for comparing and contrasting when we eradicate some features leveraging feature importance scores.
The complete instance of assessing a logistic regression model leveraging all features as input on our synthetic dataset is listed below.
# evaluation of a model using all features
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression(solver=’liblinear’)
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print(‘Accuracy: %.2f’ % (accuracy*100))
Running the instance prior to the logistic regression model on the training dataset and assesses it on the test set.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
In this scenario we can observe that the model accomplished the classification precision of approximately 84.55 percent leveraging all features within the dataset.
Accuracy: 84.55
Provided the we have developed the dataset, we would expect improved or similar outcomes with the half the number of input variables.
We can leverage the SelectFromModel class to provide definition to both of the models we desire to calculate importance scores, RandomForestClassifier in this scenario, and the number of features to choose, five, in this scenario.
…
# configure to select a subset of features
fs = SelectFromModel(RandomForestClassifier(n_estimators=200), max_features=5)
We can fit the feature selection strategy on the training dataset.
This will calculate the importance scores that can be leveraged to rank all input features. We can then have application of this method as a transform to choose a subset of five most critical features from the dataset. This transform will have application to the training dataset and the test set.
…
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
Inputting all of this together, the complete instance of leveraging random forest feature importance for feature selection s listed below:
# evaluation of a model using 5 features chosen with random forest importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
fs = SelectFromModel(RandomForestClassifier(n_estimators=1000), max_features=5)
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# define the dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LogisticRegression(solver=’liblinear’)
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print(‘Accuracy: %.2f’ % (accuracy*100))
Running the instance first performs feature selection on the dataset, then fits and assesses the logistic regression model as prior.
Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
In this scenario, we can observe the model accomplishes the same performance on the dataset, even though with 50% the number of input features. As one would expect, the feature importance scores calculated by random forest enabled them to precisely rank the input features and delete those that were not of any relevance to the target variable.
Accuracy: 84.55
Conclusion
In this article by AICoreSpot, you learned about feature importance scores for machine learning in Python.
Particularly, you learned:
- The part of feature importance in a predictive modelling problem
- How to calculate and review feature importance from linear models and decision trees
- How to calculate and review permutation feature importance scores.