Cost-sensitive logistic regression with regards to imbalanced classification
Logistic regression is not compatible with imbalanced classification directly.
Rather, the training algorithm leveraged in fitting the logistic regression model ought to be altered to take the skewed distribution into consideration. This can be accomplished by mentioning a class weighting configuration that is leveraged to influence the amount that logistic regression coefficients receive updates during the course of training.
The weighting can penalize the model less for errors committed on instances from the majority class and penalize the model more for errors committed on instances from the minority class. The outcome is a variant of logistic regression that feature improved performance on imbalanced classification activities, generally referenced to as cost-sensitive or weighted logistic regression.
In this guide, you will find out about cost-sensitive logistic regression for imbalanced classification.
After going through this guide, you will be aware of:
- How traditional logistic regression is not compatible with imbalanced classification.
- How logistic regression can be altered to weight model error by class weight during fitment of the coefficients.
- How to setup class weights for logistic regression and how to grid search differing class weight configurations.
Tutorial Summarization
This guide is subdivided into five portions, which are:
- Imbalanced Classification Dataset
- Logistic Regression for Imbalanced Classification
- Weighted Logistic Regression with Scikit-learn
- Grid search weighted logistic regression
Imbalanced Classification Dataset
Prior to diving into the modification of logistic regression for imbalanced classification, let’s start with defining an imbalanced classification dataset.
We can leverage the make_classification() function to define a synthetic imbalanced dual-class classifications dataset. We will produce 10k instances with an approximate 1:100 minority to majority class ratio.
1 2 3 4 | … # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2) |
Upon generation, we can summarize the class distribution to confirm that the dataset was developed as we predicted.
…
# summarize class distribution
counter = Counter(y)
print(counter)
Lastly, we can develop a scatter plot of the instances and colour them by class label to assist in comprehending the hurdle of classification of instances from this dataset.
…
# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Bringing all of this together, the full instance of producing the synthetic dataset and plotting the instances is detailed below.
# Generate and plot a synthetic imbalanced classification dataset
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
# define dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# summarize class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Executing the instance first develops the dataset and summarizes the class distribution.
We can observe that the dataset has an approximate 1:100 class distribution with a bit less than 10k instances in the majority class and 100 in the minority class.
Counter({0: 9900, 1: 100})
Then, a scatter plot of the dataset is developed displaying the large mass of instances for the majority class (blue) and a minimal number of instances for the minority class (orange), with some modest class overlap.
Then, we can fit a traditional logistic regression model on the dataset.
We will leverage repeated cross-validation to assess the model, with three repeats of ten-fold cross-validation. The mode performance will be reported leveraging the mean ROC area under curve (ROC AUC) averaged over repeats and all folds.
…
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)
# summarize performance
print(‘Mean ROC AUC: %.3f’ % mean(scores))
Bringing all of this together, the full instance of assessed traditional logistic regression on the imbalanced classification problem is detailed below.
# fit a logistic regression model on an imbalanced classification dataset
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
model = LogisticRegression(solver=’lbfgs’)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)
# summarize performance
print(‘Mean ROC AUC: %.3f’ % mean(scores))
Executing the instance assesses the traditional logistic regression model on the imbalanced dataset and reports the mean ROC AUC.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average result.
We can observe that the model possesses skill, accomplishing a ROC AUC >0.5, in this scenario accomplishing a mean score of 0.985.
Mean ROC AUC: 0.985
This furnishes a baseline for contrast for any alterations carried out to the traditional logistic regression problem.
Logistic Regression for Imbalanced Classification
Logistic regression is an efficient model/framework for binary classification activities, even though by default, it is not at all effective at imbalanced classification.
Logistic regression can be altered to be better apt for logistic regression.
The coefficients of the logistic regression algorithm are fitted leveraging an optimization algorithm that reduces the negative log likelihood (loss) for the model on the training dataset.
This is consists of the repetitive use of the model to make forecasts followed by an adaptation of the coefficients in a direction that minimizes the loss of the model.
The calculation of the loss for a provided grouping of coefficients can be altered to take the class balance into account.
By default, the errors for every class might be considered to possess the same weighting, say 1.0 These weightings can be altered on the basis of the criticality of every class.
- Minimize sum i to n –(w0 * log(yhat_i) * y_i + w1 * log(1 – yhat_i) * (1 – y_i))
The weighting has its application to the loss so that lesser weight values have the outcome of a reduced error value, and in turn, less updates to the model coefficients. A bigger weight value has the outcome of a bigger error calculation, and in turn, more update to the model coefficients.
- Small weight: Less criticality, less update to the model coefficients
- Large weight: More criticality, more updates to the model coefficients.
As such, the altered variant of logistic regression is referenced to as Weighted Logistic Regression, Class-Weighted Logistic Regression or Cost-Sensitive Logistic Regression
The weightings are at times referenced to as critical weightings.
Even though direct to implement, the challenge of weighted logistic regression is the choice of the weighting to leverage for every class.
Weighted Logistic Regression with Scikit-learn
The scikit-learn Python machine learning library furnishes an implementation of logistic regression that is compatible with class weighting.
The LogisticRegression class furnishes the class_weight argument that can be specified through a model hyperparameter. The class_weight is a dictionary that defines every class label (for example 0 and 1) and the weighting to apply in the calculation of the negative log likelihood during fitment of the model.
For instance, a 1 to 1 weighting for every class 0 and 1 can be given definition to as follows:
1 2 3 4 | … # define model weights = {0:1.0, 1:1.0} model = LogisticRegression(solver=’lbfgs’, class_weight=weights) |
The class weighting can be defined in several ways, for instance:
- Domain expertise: determined by talking to subject matter experts
- Tuning: determined by a hyperparameter search like a grid search.
- Heuristic: specified leveraging a general best practice.
A best practice for leveraging the class weighting is to leverage the inversion of the class distribution existing in the training dataset.
For instance, the class distribution of the training dataset is a 1:100 ratio for the minority class to the majority class. The inverse of this ratio could be leveraged with 1 for the majority class and 100 for the minority class; for instance:
1 2 3 4 | … # define model weights = {0:1.0, 1:100.0} model = LogisticRegression(solver=’lbfgs’, class_weight=weights) |
We could additionally define the same ratio leveraging fractions and accomplish the same outcome, for instance:
1 2 3 4 | … # define model weights = {0:0.01, 1:1.0} model = LogisticRegression(solver=’lbfgs’, class_weight=weights) |
We can assess the logistic regression algorithm with a class weighting leveraging the same assessment procedure defined in the prior section.
We would expect that the class-weighted variant of logistic regression to feature better performance than the traditional variant of logistic regression with no class weighting.
The full instance is detailed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # weighted logistic regression model on an imbalanced classification dataset from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2) # define model weights = {0:0.01, 1:1.0} model = LogisticRegression(solver=’lbfgs’, class_weight=weights) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1) # summarize performance print(‘Mean ROC AUC: %.3f’ % mean(scores)) |
Executing the instance preps the synthetic imbalanced classification dataset, then assesses the class-weighted variant of logistic regression leveraging repeated cross-validation.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
The mean ROC AUC score is reported, in this scenario displaying an improved score than the unweighted variant of logistic regression, 0.989 in contrast to 0.985.
Mean ROC AUC: 0.989
The scikit-learn library furnishes an implementation of the best practice heuristic for the class weighting.
It’s implementation is through the compute_class_weight() function and is calculated as:
- n_samples / (n_classes * n_samples_with_class)
We can evaluate this calculation primarily on our dataset. For instance, we have 10,000 instances within the dataset, 9900 in class 0, and 100 in class 1.
The weighting for class 0 is calculated as:
- weighting = n_samples / (n_classes * n_samples_with_class)
- weighting = 10000 / (2*9900)
- weighting = 10000 / 19800
- weighting = 0.05
The weighting for class 1 is calculated as:
- weighting = n_samples / (n_classes * n_samples_with_class)
- weighting = 10000 / (2*100)
- weighting = 10000 /200
- weighting = 50
We can gain confirmation for these calculations by calling the compute_class_weight() function and specifying the class_weight as “balanced” For instance:
1 2 3 4 5 6 7 8 9 | # calculate heuristic class weighting from sklearn.utils.class_weight import compute_class_weight from sklearn.datasets import make_classification # generate 2 class dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2) # calculate class weighting weighting = compute_class_weight(‘balanced’, [0,1], y) print(weighting) |
Running the instance, we can observe that we can accomplish a weighting of approximately 0.5 for class 0 and a weighting of 50 for class 1.
These values match our manual calculation.
[ 0.50505051 50. ]
The values also correlate with our heuristic calculation above for inversion of the ratio of the class distribution in the training dataset, for instance:
- 5:50 == 1:100
We can leverage the default class balance directly with the LogisticRegression class by setting the class_weight argument to ‘balanced’. For instance:
1 2 3 | … # define model model = LogisticRegression(solver=’lbfgs’, class_weight=’balanced’) |
The full instance is detailed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # weighted logistic regression for class imbalance with heuristic weights from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2) # define model model = LogisticRegression(solver=’lbfgs’, class_weight=’balanced’) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1) # summarize performance print(‘Mean ROC AUC: %.3f’ % mean(scores)) |
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment process, or variations in numerical accuracy. Consider running the instance a few times and contrast the average outcome.
Executing the instance provides the same mean ROC AUC as we accomplished by specifying the inverse class ratio manually.
Mean ROC AUC: 0.989
Grid Search Weighted Logistic Regression
Leveraging a class weighting which is the inverse ratio of the training information is only a heuristic.
It is feasible that improved performance can be accomplished with a differing class weighting, and this too will be dependent on the selection of performance metric leveraged to assess the model.
In this portion of the blog, we will grid search an array of differing class weightings for weighted logistic regression and find out which has the outcome of the ideal ROC AUC score.
We will try out the following weightings for class 0 and 1:
- {0:100, 1:1}
- {0:10, 1:1}
- {0:1, 1:1}
- {0:1, 1:10}
- {0:1, 1:100}
These can be defined as grid search parameters for the GridSearchCV class as follows:
1 2 3 4 | … # define grid balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}] param_grid = dict(class_weight=balance) |
We can carry out the grid search on these parameters leveraging repeated cross-validation and estimate model performance leveraging ROC AUC.
1 2 3 4 5 | … # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid search grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scori |
Upon execution, we can summarize the ideal configuration as well as all of the outcomes as follows:
1 2 3 4 5 6 7 8 9 | … # report the best configuration print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_)) # report all configurations means = grid_result.cv_results_[‘mean_test_score’] stds = grid_result.cv_results_[‘std_test_score’] params = grid_result.cv_results_[‘params’] for mean, stdev, param in zip(means, stds, params): print(“%f (%f) with: %r” % (mean, stdev, param)) |
Connecting this together, the instance below grid searches five differing class weights for logistic regression on the imbalanced dataset.
We could expect that the heuristic class weighting is the ideal performing configuration.
# grid search class weights with logistic regression for imbalance classification
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)
# define model
model = LogisticRegression(solver=’lbfgs’)
# define grid
balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
param_grid = dict(class_weight=balance)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring=’roc_auc’)
# execute the grid search
grid_result = grid.fit(X, y)
# report the best configuration
print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))
# report all configurations
means = grid_result.cv_results_[‘mean_test_score’]
stds = grid_result.cv_results_[‘std_test_score’]
params = grid_result.cv_results_[‘params’]
for mean, stdev, param in zip(means, stds, params):
print(“%f (%f) with: %r” % (mean, stdev, param))
Executing the instance assesses every class weighting leveraging repeated k-fold cross-validation and reports the ideal configuration and the associated mean ROC AUC score.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
In this scenario, we can observe that the 1:100 majority to minority class weighting accomplished the ideal mean ROC score. This matches the configuration for the general heuristic.
It might be fascinating to look into even more severe class weightings to observe their impact on the mean ROC AUC score.
1 2 3 4 5 6 | Best: 0.989077 using {‘class_weight’: {0: 1, 1: 100}} 0.982498 (0.016722) with: {‘class_weight’: {0: 100, 1: 1}} 0.983623 (0.015760) with: {‘class_weight’: {0: 10, 1: 1}} 0.985387 (0.013890) with: {‘class_weight’: {0: 1, 1: 1}} 0.988044 (0.010384) with: {‘class_weight’: {0: 1, 1: 10}} 0.989077 (0.006865) with: {‘class_weight’: {0: 1, 1: 100}} |
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Papers
- Logistic Regression in Rare Events Data, 2001
- The Estimation of Choice Probabilities from Choice Based Samples, 1977.
Books
- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
APIs
- utils.class_weight.compute_class_weight API
- linear_model.LogisticRegression API
- model_selection.GridSearchCV API
Conclusion
In this guide, you found out about cost-sensitive logistic regression for imbalanced classification.
Particularly, you learned about:
- How traditional logistic regression is not compatible with imbalanced classification.
- How logistic regression can be modified to weight model error by class weight during fitment of the coefficients.
- How to configure class weight for logistic regression and how to grid search differing class weight configurations.