>Business >​How to calculate Feature Importance leveraging Python ### ​How to calculate Feature Importance leveraging Python

Feature importance is in reference to a grouping of techniques that allocate a score to input features on the basis on how good they are at forecasting a target variable.

There are several types and sources of feature importance scores, even though famous examples consist of statistical correlational scores, coefficients calculated as part of linear models, decision trees, and permutation importance scoring.

Feature importance score have an important part to play in a predictive modelling project, which includes furnishing insights with regards to the data, insight into the model, and the basis for dimensionality reduction and feature selection that can enhance the efficiency and effectiveness of a predictive model on the issue.

In this blog post by AICoreSpot, which serves as a tutorial, you will find out about feature importance scores for machine learning in python.

After finishing this tutorial, you will be aware of:

• The part of feature importance in a predictive modelling problem
• How to calculate and review feature importance from linear models and decision trees
• How to calculate and review permutation feature importance scores

Overview

This is tutorial is demarcated into six portions, they are as follows:

• Feature Importance
• Preparation
• Check Scikit-learn Version
• Evaluate datasets
• Coefficients as Feature Importance
• Linear Regression Feature Importance
• Logistic Regression Feature Importance
• Decision tree feature importance
• CART Feature Importance
• Random Forest Feature Importance
• XGBoost Feature Importance
• Permutation Feature Importance
• Permutation Feature Importance for Classification
• Permutation Feature Importance for Regression
• Feature Selection with Importance

Feature Importance

Feature importance is in reference to a grouping of strategies for allocating scores to input features to a predictive model that indicates the comparative importance of every feature when making a forecast.

Feature importance scores can be quantified for issues that consist of forecasting a numerical value, referred to as regression, and those issues that consist of forecasting a class label, referred to as classification.

The scores are useful and can be leveraged in an array of scenarios in a predictive modelling issue, like:

• Improved comprehension of the data
• Improved understanding of a model
• Minimizing the number of input features

Feature importance scores can furnish insight into the dataset: The comparative scores can highlight which features may be most apt to the target, and the converse, which features don’t hold any relevance. This can be interpreted by a domain specialist and could be leveraged as the foundation for collecting more or differing data.

Feature importance scores can furnish insight into the model. A majority of importance scores are estimated through a predictive model that has been fit on the dataset. Inspecting the importance score furnishes insight into that particular model and which features are the most critical and least critical to the model when rendering a prediction. This is a variant of model interpretation that can be executed for those models that are compatible with it.

Feature importance can be leveraged to enhance a predictive model. This can be accomplished by leveraging the importance scores to choose those features to delete (lowest scores) or those features to retain (highest scores). This is a variant of feature selection and simplify the issue that is being modelled, quicken up the modelling procedure (removing features is referred to as dimensionality reduction), and in some scenarios, enhance the performance of the model.

Often, we desire to quantify the strength of the relationship between the predictors and the result. Ranking predictors in this fashion can be very apt when sifting through larger amounts of information.

Feature importance scores can be input to a wrapper model, like the SelectFromModel class, to execute feature selection.

There are several ways to calculate feature importance scores and several models that can be leveraged for this reason.

Probably the easiest way is to calculate simplistic coefficient statistics amongst every feature and the target variable.

In this guide, we will observe the three primary variants of more sophisticated feature importance, they are as follows:

• Feature importance from model coefficients
• Feature importance from decision trees
• Feature importance from permutation testing

Prep

Prior to diving in, let’s validate our environment and prep some test datasets.

Check Scikit-Learn version

To start with, validate that you possess a modern version of the scikit-learn library setup.

This is critical as a few of the models we will look into in this guide need an advanced version of the library.

You can verify the version of the library you have setup with the following code instance:

# check scikit-learn version

import sklearn

print(sklearn.__version__)

Running the example will print the version of the library. At the timeframe of writing, this deals with version 0.22.

You are required to be on this version of scikit-learn or higher.

0.22.1

Test Datasets

To follow-up, let’s define a few test datasets that we can leverage as the basis for illustrating and looking into feature importance scores.

Every test issue has five critical and five unimportant features, and it may be fascinating to observe which methodologies are consistent at identifying or differentiating the features on the basis of their criticality.

Classification Dataset

We will leverage the make_classificiation() function to develop a test binary classification dataset.

The data set will possess 1,000 instances, with 10 input features, five of which will be informative, and the other five will be redundant. We will fix the arbitrary number seed to make sure we obtain the same instances every time the code is executed,

An instance of creating and summarization of the dataset is provided below:

# test classification dataset

from sklearn.datasets import make_classification

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# summarize the dataset

print(X.shape, y.shape)

Executing the instance develops the dataset and validates the expected number of samples and features.

(1000, 10) (1000,)

Regression Dataset

We will leverage the make_regression() function to develop a test regression dataset.

Like the classification dataset, the regression dataset will possess 1,000 instances, with 10 input features, five of which will be informative and the other five that will be redundant.

# test regression dataset

from sklearn.datasets import make_regression

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# summarize the dataset

print(X.shape, y.shape)

Executing the instance creates the dataset and validates the expected number of samples and features.

(1000, 10) (1000,)

Now, let’s take a deeper look at coefficients as importance scores.

Coefficients as Feature Importance

Linear machine learning algorithms fit a model where the forecast is the weighted total of the input values.

Instances consist of linear regression, logistic regression, and extensions that add regularization, like ridge regression and the elastic net.

Each one of these algorithms identify a grouping of coefficients to leverage in the weighted total in order to make a forecast. These coefficients can be leveraged directly as ca crude variant of feature importance score.

Let’s delve deeper and look at leveraging coefficients as feature importance for classification and regression. We will fit a model on the dataset to identify the coefficients, then summarize the critical scores for every input feature and ultimately develop a bar chart to obtain an idea of the comparative criticality of the features.

Linear Regression Feature Importance

We can fit a linear regression model on the regression dataset and retrieve the coefficient property that consists of the coefficients identified for every input variable.

These coefficients can furnish the basis for a crude feature importance score. This goes by the assumption that the input variables have the same scale or have been scaled prior to fitting a model.

The complete instance of linear regression coefficients for feature importance is listed below:

# linear regression feature importance

from sklearn.datasets import make_regression

from sklearn.linear_model import LinearRegression

from matplotlib import pyplot

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# define the model

model = LinearRegression()

# fit the model

model.fit(X, y)

Executing the instance fits the model, then reports the coefficient value for every feature.

Your results may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The scores indicate that the model identified the five critical features and marked all other features with a zero coefficient, basically deleting them from the model.

Feature: 0, Score: 0.00000

Feature: 1, Score: 12.44483

Feature: 2, Score: -0.00000

Feature: 3, Score: -0.00000

Feature: 4, Score: 93.32225

Feature: 5, Score: 86.50811

Feature: 6, Score: 26.74607

Feature: 7, Score: 3.28535

Feature: 8, Score: -0.00000

Feature: 9, Score: 0.00000

A bar chart is the developed for the feature importance scores This strategy might also be leveraged with Ridge and ElasticNet models.

Logistic Regression Feature Importance

We can fit a logistic regression model on the regression dataset and retrieve the coeff_ property that consists of the coefficients identified for every input variable.

The coefficients can furnish the basis for a crude feature importance score. This goes by the assumption that the input variables have the same scale or have been scaled before to fitting a model.

The complete instance of logistic regression coefficients for feature importance is enlisted below:

# logistic regression for feature importance

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression

from matplotlib import pyplot

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# define the model

model = LogisticRegression()

# fit the model

model.fit(X, y)

# get importance

importance = model.coef_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Executing the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance, provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

Remember this is a classification issue with classes 0 and 1. Observe that the coefficients are both positive and negative. The positive scores suggest a feature that forecasts class 1, whereas the negative scores suggest a feature that forecasts class 0.

No overt pattern of critical and non-critical features can be detected from these outcomes, at least from what can be deciphered,

Feature: 0, Score: 0.16320

Feature: 1, Score: -0.64301

Feature: 2, Score: 0.48497

Feature: 3, Score: -0.46190

Feature: 4, Score: 0.18432

Feature: 5, Score: -0.11978

Feature: 6, Score: -0.40602

Feature: 7, Score: 0.03772

Feature: 8, Score: -0.51785

Feature: 9, Score: 0.26540

A bar chart is then leveraged for the feature importance scores. Now that we have observed the leveraging of coefficients as importance scores, let’s observe the more typical instance of decision-tree based importance scores.

Decision Tree Feature Importance

Decision Tree Algorithms such as classification and regression trees (CART) provide importance scores on the basis of reduction in the criterion leveraged to choose split points, like Gini or entropy.

The same strategy can be deployed for ensembles of decision tress, like the random forest and stochastic gradient boosting algorithms

Let’s observe a worked example of each.

CART Feature Importance

We can leverage the CART algorithm for feature importance implemented in sci-kit learn as the DecisionTreeRegressor and DecisionTreeClassifier Classes.

Upon being fit, the model furnishes a feature_importances_property which can be accessed to retrieve the relative importance scores for every input feature.

Let’s observe an instance of this for classification and regression.

CART Regression Feature Importance

The complete instance of fitting a DecisionTreeRegressor and summarizing the calculated feature importance scores is listed below.

# decision tree for feature importance on a regression problem

from sklearn.datasets import make_regression

from sklearn.tree import DecisionTreeRegressor

from matplotlib import pyplot

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# define the model

model = DecisionTreeRegressor()

# fit the model

model.fit(X, y)

# get importance

importance = model.feature_importances_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Executing the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

This outcome indicate perhaps three of the ten features as being critical to prediction.

[Control]

 1 2 3 4 5 6 7 8 9 10 Feature: 0, Score: 0.00294 Feature: 1, Score: 0.00502 Feature: 2, Score: 0.00318 Feature: 3, Score: 0.00151 Feature: 4, Score: 0.51648 Feature: 5, Score: 0.43814 Feature: 6, Score: 0.02723 Feature: 7, Score: 0.00200 Feature: 8, Score: 0.00244 Feature: 9, Score: 0.00106

A bar chart is then produced for the feature importance scores. CART Classification Feature Importance

The complete instance of fitting a DecisionTreeClassifier and summarizing the calculated feature importance scores is listed below:

 1234 5678 91011 12131415161718. # decision tree for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # define dataset X,y= make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = DecisionTreeClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print(‘Feature: %0d, Score: %.5f’ % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Executing the instance fits the model, the reports the coefficient value for every feature. Your outcome may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The outcomes indicate perhaps four of the ten features as being critical to prediction.

Feature: 0, Score: 0.01486

Feature: 1, Score: 0.01029

Feature: 2, Score: 0.18347

Feature: 3, Score: 0.30295

Feature: 4, Score: 0.08124

Feature: 5, Score: 0.00600

Feature: 6, Score: 0.19646

Feature: 7, Score: 0.02908

Feature: 8, Score: 0.12820

Feature: 9, Score: 0.04745

A bar chart is then developed for the feature importance scores. Random Forest Regression Feature Importance

The complete example of fitting a RandomForestRegressor and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a regression problem

from sklearn.datasets import make_regression

from sklearn.ensemble import RandomForestRegressor

from matplotlib import pyplot

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# define the model

model = RandomForestRegressor()

# fit the model

model.fit(X, y)

# get importance

importance = model.feature_importances_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Running the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The outcomes indicate perhaps two or three of the ten features as being critical to predicition.

Feature: 0, Score: 0.00280

Feature: 1, Score: 0.00545

Feature: 2, Score: 0.00294

Feature: 3, Score: 0.00289

Feature: 4, Score: 0.52992

Feature: 5, Score: 0.42046

Feature: 6, Score: 0.02663

Feature: 7, Score: 0.00304

Feature: 8, Score: 0.00304

Feature: 9, Score: 0.00283

A bar chart is then generated for the feature importance scores. Random Forest Classification Feature Importance

The complete example of fitting a RandomForestClassifier and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a classification problem

from sklearn.datasets import make_classification

from sklearn.ensemble import RandomForestClassifier

from matplotlib import pyplot

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# define the model

model = RandomForestClassifier()

# fit the model

model.fit(X, y)

# get importance

importance = model.feature_importances_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Executing the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The outcome indicates perhaps two or three of the 10 features as being critical to forecasting.

Feature: 0, Score: 0.06523

Feature: 1, Score: 0.10737

Feature: 2, Score: 0.15779

Feature: 3, Score: 0.20422

Feature: 4, Score: 0.08709

Feature: 5, Score: 0.09948

Feature: 6, Score: 0.10009

Feature: 7, Score: 0.04551

Feature: 8, Score: 0.08830

Feature: 9, Score: 0.04493

A bar chart is subsequently developed for the feature importance scores. XGBoost Feature Importance

XGBoost is a library that furnishes an efficient and effective implementation of the stochastic gradient boosting algorithm.

This algorithm can be leveraged with scikit-learn through the XGBRegressor and the XGBClassifier classes.

Upon fitting, the model furnishes a feature_importances_property that can be accessed to retrieve the comparative importance scores for every input feature.

This algorithm is also furnished through scikit-learn through the GradientBoostingClassifier and GradientBoostingRegressor classes and the same strategy to feature selection can be leveraged.

sudo pip install xgboost

Then validate that the library was setup correctly and functions by checking the version number.

# check xgboost version

import xgboost

print(xgboost.__version__)

Executing the instance, you should observe the following version number or higher.

0.90

Let’s observe an instance of XGBoost for Feature Importance on regression and classification problems.

XGBoost Regression Feature Importance

The complete instance of fitting a XGBRegressor and summarizing the calculated feature importance scores is listed below:

# xgboost for feature importance on a regression problem

from sklearn.datasets import make_regression

from xgboost import XGBRegressor

from matplotlib import pyplot

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# define the model

model = XGBRegressor()

# fit the model

model.fit(X, y)

# get importance

importance = model.feature_importances_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Running the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The results indicate perhaps two or three of the ten features as being critical to prediction.

Feature: 0, Score: 0.00060

Feature: 1, Score: 0.01917

Feature: 2, Score: 0.00091

Feature: 3, Score: 0.00118

Feature: 4, Score: 0.49380

Feature: 5, Score: 0.42342

Feature: 6, Score: 0.05057

Feature: 7, Score: 0.00419

Feature: 8, Score: 0.00124

Feature: 9, Score: 0.00491

A bar chart is then developed for the feature importance scores. XGBoost Classification Feature Importance

The complete instance of fitting an XGBClassifier and summarization of the calculated feature importance scores is listed below.

# xgboost for feature importance on a classification problem

from sklearn.datasets import make_classification

from xgboost import XGBClassifier

from matplotlib import pyplot

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# define the model

model = XGBClassifier()

# fit the model

model.fit(X, y)

# get importance

importance = model.feature_importances_

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Executing the instance fits the model and then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The results indicate perhaps 7/10 features as being critical to prediction.

Feature: 0, Score: 0.02464

Feature: 1, Score: 0.08153

Feature: 2, Score: 0.12516

Feature: 3, Score: 0.28400

Feature: 4, Score: 0.12694

Feature: 5, Score: 0.10752

Feature: 6, Score: 0.08624

Feature: 7, Score: 0.04820

Feature: 8, Score: 0.09357

Feature: 9, Score: 0.02220

A bar chart is then developed for the feature importance scores. Permutation Feature Importance

The complete example of fitting a KNEighborsRegressor and summarization of the calculated permutation feature importance scores are enlisted below.

# permutation feature importance with knn for regression

from sklearn.datasets import make_regression

from sklearn.neighbors import KNeighborsRegressor

from sklearn.inspection import permutation_importance

from matplotlib import pyplot

# define dataset

X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# define the model

model = KNeighborsRegressor()

# fit the model

model.fit(X, y)

# perform permutation importance

results = permutation_importance(model, X, y, scoring=’neg_mean_squared_error’)

# get importance

importance = results.importances_mean

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Running the instance fits the model, then reports the coefficient value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The results indicate perhaps two or three of the ten features as being critical to forecasting.

Feature: 0, Score: 175.52007

Feature: 1, Score: 345.80170

Feature: 2, Score: 126.60578

Feature: 3, Score: 95.90081

Feature: 4, Score: 9666.16446

Feature: 5, Score: 8036.79033

Feature: 6, Score: 929.58517

Feature: 7, Score: 139.67416

Feature: 8, Score: 132.06246

Feature: 9, Score: 84.94768

A bar chart is then produced for the feature importance scores. Permutation Feature Importance For Classification

The complete instance of fitting a KNeighborsClassifer and summarization of the calculated permutation feature importance scores are listed below:

# permutation feature importance with knn for classification

from sklearn.datasets import make_classification

from sklearn.neighbors import KNeighborsClassifier

from sklearn.inspection import permutation_importance

from matplotlib import pyplot

# define dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# define the model

model = KNeighborsClassifier()

# fit the model

model.fit(X, y)

# perform permutation importance

results = permutation_importance(model, X, y, scoring=’accuracy’)

# get importance

importance = results.importances_mean

# summarize feature importance

for i,v in enumerate(importance):

print(‘Feature: %0d, Score: %.5f’ % (i,v))

# plot feature importance

pyplot.bar([x for x in range(len(importance))], importance)

pyplot.show()

Running the instance fits the model, then reports the coefficients value for every feature.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The outcomes indicate perhaps two or three of the ten features as being critical to forcasting.

Feature: 0, Score: 0.04760

Feature: 1, Score: 0.06680

Feature: 2, Score: 0.05240

Feature: 3, Score: 0.09300

Feature: 4, Score: 0.05140

Feature: 5, Score: 0.05520

Feature: 6, Score: 0.07920

Feature: 7, Score: 0.05560

Feature: 8, Score: 0.05620

Feature: 9, Score: 0.03080

A bar chart is then generated with regards to the feature importance scores. Feature Selection with Importance

Feature Importance scores can be leveraged to assist interpreting the data, however they can also be leveraged directly to assist rank and select features that are most critical to a predictive model.

Remember, our synthetic dataset possesses 1,000 instances each one with 10 input variables, five of which are redundant/irrelevant and five of which are critical to the result. We can leverage feature importance scores to assist in choosing the five variables that are apt and just use them as inputs to a predictive model.

To start with, we can demarcate the training dataset into train and test sets and go about training a model on the training dataset, make forecasts on the evaluation set and assess the outcome leveraging classification precision.

We will leverage a logistic regression model as the predictive model.

The furnishes a baseline for comparing and contrasting when we eradicate some features leveraging feature importance scores.

The complete instance of assessing a logistic regression model leveraging all features as input on our synthetic dataset is listed below.

# evaluation of a model using all features

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# define the dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LogisticRegression(solver=’liblinear’)

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print(‘Accuracy: %.2f’ % (accuracy*100))

Running the instance prior to the logistic regression model on the training dataset and assesses it on the test set.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

In this scenario we can observe that the model accomplished the classification precision of approximately 84.55 percent leveraging all features within the dataset.

Accuracy: 84.55

Provided the we have developed the dataset, we would expect improved or similar outcomes with the half the number of input variables.

We can leverage the SelectFromModel class to provide definition to both of the models we desire to calculate importance scores, RandomForestClassifier in this scenario, and the number of features to choose, five, in this scenario.

# configure to select a subset of features

fs = SelectFromModel(RandomForestClassifier(n_estimators=200), max_features=5)

We can fit the feature selection strategy on the training dataset.

This will calculate the importance scores that can be leveraged to rank all input features. We can then have application of this method as a transform to choose a subset of five most critical features from the dataset. This transform will have application to the training dataset and the test set.

# learn relationship from training data

fs.fit(X_train, y_train)

# transform train input data

X_train_fs = fs.transform(X_train)

# transform test input data

X_test_fs = fs.transform(X_test)

Inputting all of this together, the complete instance of leveraging random forest feature importance for feature selection s listed below:

# evaluation of a model using 5 features chosen with random forest importance

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# feature selection

def select_features(X_train, y_train, X_test):

# configure to select a subset of features

fs = SelectFromModel(RandomForestClassifier(n_estimators=1000), max_features=5)

# learn relationship from training data

fs.fit(X_train, y_train)

# transform train input data

X_train_fs = fs.transform(X_train)

# transform test input data

X_test_fs = fs.transform(X_test)

return X_train_fs, X_test_fs, fs

# define the dataset

X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# feature selection

X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)

# fit the model

model = LogisticRegression(solver=’liblinear’)

model.fit(X_train_fs, y_train)

# evaluate the model

yhat = model.predict(X_test_fs)

# evaluate predictions

accuracy = accuracy_score(y_test, yhat)

print(‘Accuracy: %.2f’ % (accuracy*100))

Running the instance first performs feature selection on the dataset, then fits and assesses the logistic regression model as prior.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

In this scenario, we can observe the model accomplishes the same performance on the dataset, even though with 50% the number of input features. As one would expect, the feature importance scores calculated by random forest enabled them to precisely rank the input features and delete those that were not of any relevance to the target variable.

Accuracy: 84.55

Conclusion