>Business (Page 2)

Logistic regression is not compatible with imbalanced classification directly.

Rather, the training algorithm leveraged in fitting the logistic regression model ought to be altered to take the skewed distribution into consideration. This can be accomplished by mentioning a class weighting configuration that is leveraged to influence the amount that logistic regression coefficients receive updates during the course of training.

The weighting can penalize the model less for errors committed on instances from the majority class and penalize the model more for errors committed on instances from the minority class. The outcome is a variant of logistic regression that feature improved performance on imbalanced classification activities, generally referenced to as cost-sensitive or weighted logistic regression.

In this guide, you will find out about cost-sensitive logistic regression for imbalanced classification.

After going through this guide, you will be aware of:

  • How traditional logistic regression is not compatible with imbalanced classification.
  • How logistic regression can be altered to weight model error by class weight during fitment of the coefficients.
  • How to setup class weights for logistic regression and how to grid search differing class weight configurations.

Tutorial Summarization

This guide is subdivided into five portions, which are:

  1. Imbalanced Classification Dataset
  2. Logistic Regression for Imbalanced Classification
  3. Weighted Logistic Regression with Scikit-learn
  4. Grid search weighted logistic regression

Imbalanced Classification Dataset

Prior to diving into the modification of logistic regression for imbalanced classification, let’s start with defining an imbalanced classification dataset.

We can leverage the make_classification() function to define a synthetic imbalanced dual-class classifications dataset. We will produce 10k instances with an approximate 1:100 minority to majority class ratio.

1

2

3

4

# define dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

 

Upon generation, we can summarize the class distribution to confirm that the dataset was developed as we predicted.

# summarize class distribution

counter = Counter(y)

print(counter)

 

Lastly, we can develop a scatter plot of the instances and colour them by class label to assist in comprehending the hurdle of classification of instances from this dataset.

# scatter plot of examples by class label

for label, _ in counter.items():

                  row_ix = where(y == label)[0]

                  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()

 

Bringing all of this together, the full instance of producing the synthetic dataset and plotting the instances is detailed below.

# Generate and plot a synthetic imbalanced classification dataset

from collections import Counter

from sklearn.datasets import make_classification

from matplotlib import pyplot

from numpy import where

# define dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

                  n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# summarize class distribution

counter = Counter(y)

print(counter)

# scatter plot of examples by class label

for label, _ in counter.items():

                  row_ix = where(y == label)[0]

                  pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()

 

Executing the instance first develops the dataset and summarizes the class distribution.

We can observe that the dataset has an approximate 1:100 class distribution with a bit less than 10k instances in the majority class and 100 in the minority class.

Counter({0: 9900, 1: 100})

Then, a scatter plot of the dataset is developed displaying the large mass of instances for the majority class (blue) and a minimal number of instances for the minority class (orange), with some modest class overlap.

Then, we can fit a traditional logistic regression model on the dataset.

We will leverage repeated cross-validation to assess the model, with three repeats of ten-fold cross-validation. The mode performance will be reported leveraging the mean ROC area under curve (ROC AUC) averaged over repeats and all folds.

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)

# summarize performance

print(‘Mean ROC AUC: %.3f’ % mean(scores))

Bringing all of this together, the full instance of assessed traditional logistic regression on the imbalanced classification problem is detailed below.

# fit a logistic regression model on an imbalanced classification dataset

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

                  n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# define model

model = LogisticRegression(solver=’lbfgs’)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)

# summarize performance

print(‘Mean ROC AUC: %.3f’ % mean(scores))

Executing the instance assesses the traditional logistic regression model on the imbalanced dataset and reports the mean ROC AUC.

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average result.

We can observe that the model possesses skill, accomplishing a ROC AUC >0.5, in this scenario accomplishing a mean score of 0.985.

Mean ROC AUC: 0.985

This furnishes a baseline for contrast for any alterations carried out to the traditional logistic regression problem.

Logistic Regression for Imbalanced Classification

Logistic regression is an efficient model/framework for binary classification activities, even though by default, it is not at all effective at imbalanced classification.

Logistic regression can be altered to be better apt for logistic regression.

The coefficients of the logistic regression algorithm are fitted leveraging an optimization algorithm that reduces the negative log likelihood (loss) for the model on the training dataset.

This is consists of the repetitive use of the model to make forecasts followed by an adaptation of the coefficients in a direction that minimizes the loss of the model.

The calculation of the loss for a provided grouping of coefficients can be altered to take the class balance into account.

By default, the errors for every class might be considered to possess the same weighting, say 1.0 These weightings can be altered on the basis of the criticality of every class.

  • Minimize sum i to n –(w0 * log(yhat_i) * y_i + w1 * log(1 – yhat_i) * (1 – y_i))

The weighting has its application to the loss so that lesser weight values have the outcome of a reduced error value, and in turn, less updates to the model coefficients. A bigger weight value has the outcome of a bigger error calculation, and in turn, more update to the model coefficients.

  • Small weight: Less criticality, less update to the model coefficients
  • Large weight: More criticality, more updates to the model coefficients.

As such, the altered variant of logistic regression is referenced to as Weighted Logistic Regression, Class-Weighted Logistic Regression or Cost-Sensitive Logistic Regression

The weightings are at times referenced to as critical weightings.

Even though direct to implement, the challenge of weighted logistic regression is the choice of the weighting to leverage for every class.

Weighted Logistic Regression with Scikit-learn

The scikit-learn Python machine learning library furnishes an implementation of logistic regression that is compatible with class weighting.

The LogisticRegression class furnishes the class_weight argument that can be specified through a model hyperparameter. The class_weight is a dictionary that defines every class label (for example 0 and 1) and the weighting to apply in the calculation of the negative log likelihood during fitment of the model.

For instance, a 1 to 1 weighting for every class 0 and 1 can be given definition to as follows:

1

2

3

4

# define model

weights = {0:1.0, 1:1.0}

model = LogisticRegression(solver=’lbfgs’, class_weight=weights)

The class weighting can be defined in several ways, for instance:

  • Domain expertise: determined by talking to subject matter experts
  • Tuning: determined by a hyperparameter search like a grid search.
  • Heuristic: specified leveraging a general best practice.

A best practice for leveraging the class weighting is to leverage the inversion of the class distribution existing in the training dataset.

For instance, the class distribution of the training dataset is a 1:100 ratio for the minority class to the majority class. The inverse of this ratio could be leveraged with 1 for the majority class and 100 for the minority class; for instance:

1

2

3

4

# define model

weights = {0:1.0, 1:100.0}

model = LogisticRegression(solver=’lbfgs’, class_weight=weights)

We could additionally define the same ratio leveraging fractions and accomplish the same outcome, for instance:

1

2

3

4

# define model

weights = {0:0.01, 1:1.0}

model = LogisticRegression(solver=’lbfgs’, class_weight=weights)

We can assess the logistic regression algorithm with a class weighting leveraging the same assessment procedure defined in the prior section.

We would expect that the class-weighted variant of logistic regression to feature better performance than the traditional variant of logistic regression with no class weighting.

The full instance is detailed below:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# weighted logistic regression model on an imbalanced classification dataset

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# define model

weights = {0:0.01, 1:1.0}

model = LogisticRegression(solver=’lbfgs’, class_weight=weights)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)

# summarize performance

print(‘Mean ROC AUC: %.3f’ % mean(scores))

Executing the instance preps the synthetic imbalanced classification dataset, then assesses the class-weighted variant of logistic regression leveraging repeated cross-validation.

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

The mean ROC AUC score is reported, in this scenario displaying an improved score than the unweighted variant of logistic regression, 0.989 in contrast to 0.985.

Mean ROC AUC: 0.989

The scikit-learn library furnishes an implementation of the best practice heuristic for the class weighting.

It’s implementation is through the compute_class_weight() function and is calculated as:

  • n_samples / (n_classes * n_samples_with_class)

We can evaluate this calculation primarily on our dataset. For instance, we have 10,000 instances within the dataset, 9900 in class 0, and 100 in class 1.

The weighting for class 0 is calculated as:

  • weighting = n_samples / (n_classes * n_samples_with_class)
  • weighting = 10000 / (2*9900)
  • weighting = 10000 / 19800
  • weighting = 0.05

The weighting for class 1 is calculated as:

  • weighting = n_samples / (n_classes * n_samples_with_class)
  • weighting = 10000 / (2*100)
  • weighting = 10000 /200
  • weighting = 50

We can gain confirmation for these calculations by calling the compute_class_weight() function and specifying the class_weight as “balanced” For instance:

1

2

3

4

5

6

7

8

9

# calculate heuristic class weighting

from sklearn.utils.class_weight import compute_class_weight

from sklearn.datasets import make_classification

# generate 2 class dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# calculate class weighting

weighting = compute_class_weight(‘balanced’, [0,1], y)

print(weighting)

Running the instance, we can observe that we can accomplish a weighting of approximately 0.5 for class 0 and a weighting of 50 for class 1.

These values match our manual calculation.

[ 0.50505051 50. ]

The values also correlate with our heuristic calculation above for inversion of the ratio of the class distribution in the training dataset, for instance:

  • 5:50 == 1:100

We can leverage the default class balance directly with the LogisticRegression class by setting the class_weight argument to ‘balanced’. For instance:

1

2

3

# define model

model = LogisticRegression(solver=’lbfgs’, class_weight=’balanced’)

The full instance is detailed below:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

# weighted logistic regression for class imbalance with heuristic weights

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# define model

model = LogisticRegression(solver=’lbfgs’, class_weight=’balanced’)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring=’roc_auc’, cv=cv, n_jobs=-1)

# summarize performance

print(‘Mean ROC AUC: %.3f’ % mean(scores))

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment process, or variations in numerical accuracy. Consider running the instance a few times and contrast the average outcome.

Executing the instance provides the same mean ROC AUC as we accomplished by specifying the inverse class ratio manually.

Mean ROC AUC: 0.989

Grid Search Weighted Logistic Regression

Leveraging a class weighting which is the inverse ratio of the training information is only a heuristic.

It is feasible that improved performance can be accomplished with a differing class weighting, and this too will be dependent on the selection of performance metric leveraged to assess the model.

In this portion of the blog, we will grid search an array of differing class weightings for weighted logistic regression and find out which has the outcome of the ideal ROC AUC score.

We will try out the following weightings for class 0 and 1:

  • {0:100, 1:1}
  • {0:10, 1:1}
  • {0:1, 1:1}
  • {0:1, 1:10}
  • {0:1, 1:100}

These can be defined as grid search parameters for the GridSearchCV class as follows:

1

2

3

4

# define grid

balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]

param_grid = dict(class_weight=balance)

We can carry out the grid search on these parameters leveraging repeated cross-validation and estimate model performance leveraging ROC AUC.

1

2

3

4

5

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define grid search

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scori

Upon execution, we can summarize the ideal configuration as well as all of the outcomes as follows:

1

2

3

4

5

6

7

8

9

# report the best configuration

print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))

# report all configurations

means = grid_result.cv_results_[‘mean_test_score’]

stds = grid_result.cv_results_[‘std_test_score’]

params = grid_result.cv_results_[‘params’]

for mean, stdev, param in zip(means, stds, params):

print(“%f (%f) with: %r” % (mean, stdev, param))

Connecting this together, the instance below grid searches five differing class weights for logistic regression on the imbalanced dataset.

We could expect that the heuristic class weighting is the ideal performing configuration.

# grid search class weights with logistic regression for imbalance classification

from numpy import mean

from sklearn.datasets import make_classification

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RepeatedStratifiedKFold

from sklearn.linear_model import LogisticRegression

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,

                  n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=2)

# define model

model = LogisticRegression(solver=’lbfgs’)

# define grid

balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]

param_grid = dict(class_weight=balance)

# define evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# define grid search

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring=’roc_auc’)

# execute the grid search

grid_result = grid.fit(X, y)

# report the best configuration

print(“Best: %f using %s” % (grid_result.best_score_, grid_result.best_params_))

# report all configurations

means = grid_result.cv_results_[‘mean_test_score’]

stds = grid_result.cv_results_[‘std_test_score’]

params = grid_result.cv_results_[‘params’]

for mean, stdev, param in zip(means, stds, params):

    print(“%f (%f) with: %r” % (mean, stdev, param))

Executing the instance assesses every class weighting leveraging repeated k-fold cross-validation and reports the ideal configuration and the associated mean ROC AUC score.

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.

In this scenario, we can observe that the 1:100 majority to minority class weighting accomplished the ideal mean ROC score. This matches the configuration for the general heuristic.

It might be fascinating to look into even more severe class weightings to observe their impact on the mean ROC AUC score.

1

2

3

4

5

6

Best: 0.989077 using {‘class_weight’: {0: 1, 1: 100}}

0.982498 (0.016722) with: {‘class_weight’: {0: 100, 1: 1}}

0.983623 (0.015760) with: {‘class_weight’: {0: 10, 1: 1}}

0.985387 (0.013890) with: {‘class_weight’: {0: 1, 1: 1}}

0.988044 (0.010384) with: {‘class_weight’: {0: 1, 1: 10}}

0.989077 (0.006865) with: {‘class_weight’: {0: 1, 1: 100}}

Further Reading

This section furnishes additional resources on the subject if you are seeking to delve deeper.

Papers

  • Logistic Regression in Rare Events Data, 2001
  • The Estimation of Choice Probabilities from Choice Based Samples, 1977.

Books

  • Learning from Imbalanced Data Sets, 2018.
  • Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

APIs

  • utils.class_weight.compute_class_weight API
  • linear_model.LogisticRegression API
  • model_selection.GridSearchCV API

Conclusion

In this guide, you found out about cost-sensitive logistic regression for imbalanced classification.

Particularly, you learned about:

  • How traditional logistic regression is not compatible with imbalanced classification.
  • How logistic regression can be modified to weight model error by class weight during fitment of the coefficients.
  • How to configure class weight for logistic regression and how to grid search differing class weight configurations.

Logistic regression is not compatible with imbalanced classification directly. Rather, the training algorithm leveraged in fitting the logistic regression model ought to be altered to take the skewed distribution into consideration. This can be accomplished by mentioning a class weighting configuration that is leveraged to influence the amount that logistic regression coefficients receive updates during the course of training. The weighting can penalize the model less for errors committed on instances from the majority class and penalize the model more for errors

Principal component analysis (PCA) is an unsupervised ML strategy. Probably the most widespread leveraging of principal component analysis is dimensionality reduction. Aside from leveraging PCA as a data prep strategy, we can additionally leverage it to assist visualize data. An image is worth a million words, as they say. With the data visualization, it is simpler for us to obtain some insight and deliberate on the subsequent step in our machine learning models.

It can be more flexible to forecast odds of an observation which belon pgs to every class in a classification problem instead of forecasting classes directly. This flexibility comes from the way that probabilities might be interpreted using differing thresholds that facilitate the operator of the model to trade-off concerns in the errors committed by the model, like the number of false positives contrasted to the number of false negatives. This is needed when leveraging models where the cost

Upon fitting of a deep learning neural network model, you muswet assess its performance on an evaluation dataset. This is crucial, as the reported performance enables you to both select between candidate models and to communicate to stakeholders about how functional the model is at finding solutions to the problem. The Keras deep learning API model is really restricted in terms of the metrics that you can leverage to report the model performance.

Imagery data must be prepped prior to it being leveraged as the foundation for modelling in image classification tasks. One aspect of prepping image data is the scaling of pixel values, like normalizing the values to the range 0-1, centring, standardization, and more. How do you select a good, or even ideal, pixel scaling method for your image classification or computer vision modelling task? In this guide, you will find out how to select a pixel scaling strategy for image classification with deep

There is considerable confusion regarding what demarcates encryption, hashing, encoding, and obfuscation. Let’s observe each one, one-by-one. Encoding The rationale behind encoding is to transform information so that it can be correctly, safely, consumed by a differing variant of system, for example, binary data being transmitted via email, or the viewing of special characters on a web page. The objective is not to retain the secrecy of data, but instead to make sure that it’s ready to be consumed in

This blog article by AICorespot is a comprehensive guide to determining what variant of security assessment to leverage in a provided scenario, ranging from basic evaluations, bounties, and red team. There’s a ton of debate/discourse occurring right now in the information security community with regards to the advantages of penetration testing vs. bug bounties, pentesting vs. vulnerability evaluations, bug bounties or a red team engagement, and the part played by trusted advisors in all of it.

If you’ve been working within the information security domain for a while you’ve likely listened to persons stating things like the following on several occasions: “These logs are replete with incidents that have not been reported!” “How many event alerts make an incident?” “I just got an event for the alert…” And so on. We essentially have a mess of mixed-up terminologies. There is massive confusion – even among those in the domain – with regards to what comprises an

This is the 3rd and final part of Innovation in ML and Retail, the latest multi-part blog series by AICorespot. This final part looks at the transformational power of ML within the sub-domain of e-commerce. E-Commerce has taken the world over by storm, particularly in the days of the pandemic. It is not uncommon to see same-day deliveries, even without a premium subscription. Reliability levels are high, and very rarely do packages get misplaced or not delivered. Technology that facilitates

Welcome to the 2nd part of the multi-part blog series by AICorespot, ‘Innovation in ML and Retail Part 2’. As always, AICorespot’s Editorial Team brings you the latest and greatest advancements in the world of emergent technologies and the rapidly evolving scene of Industry 4.0. As a tech blogger, based on my voluminous research, and interactions with industry experts, I’ve always said that the impact of Industry 4.0 is in many ways, comparable to the advent of the World Wide