How to leverage ROC Curves and Precision-Recall Curves with regards to Classification in Python
It can be more flexible to forecast odds of an observation which belon pgs to every class in a classification problem instead of forecasting classes directly.
This flexibility comes from the way that probabilities might be interpreted using differing thresholds that facilitate the operator of the model to trade-off concerns in the errors committed by the model, like the number of false positives contrasted to the number of false negatives. This is needed when leveraging models where the cost of a single error outweighs the cost of other variants of errors.
Dual diagnostic tools that assist in the interpretation of probabilistic forecast for binary (two-class) classification predictive modelling problems are ROC Curves and Precision-Recall curves.
In this guide, you will find out about ROC curves, Precision-Recall Curves, and when the leverage each to interpret the forecasting of probabilities for binary classification problems.
After going through this tutorial, you will be aware of:
- ROC Curves summarizes the trade-off between the true positive rate and false positive rate for a forecasted model leveraging differing probability thresholds.
- Precision-recall curves summarize the trade-off between the true positive rate and the positive forecasting value for a predictive model leveraging differing probability thresholds.
- ROC Curves are relevant when the observations are balanced amongst each class, whereas precision-recall curves are relevant for imbalanced datasets.
Tutorial Summarization
This tutorial is sub-divided into six portions, which are:
1] Forecasting Probabilities
2] What are ROC Curves?
3] ROC Curves and AUC in Python
4] What are Precision-Recall Curves?
5] Precision-recall curves and AUC in Python
6] When to leverage ROC vs. Precision-recall curves?
Forecasting probabilities
In a classification problem, we might decide to forecast the class values directly.
Alternatively, it can be more flexible to forecast the odds for every class instead. The reason for this is to furnish the capability to select and even calibrate the threshold for how to interpret the forecasted probabilities.
For instance, a default might be to leverage a threshold of 0.5, implying that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive result.
This threshold can be altered to tune the behaviour of the model for a particular problem. An instance would be to minimize more of one or another variant of error.
When making a forecast for a binary or two-class classification problem, there are dual variants of errors that we could make.
- False Positive: Forecast an event where there was no event
- False negative: Forecast no event when as a matter of fact there was an event.
By forecasting probabilities and calibration of a threshold, a balance of these dual concerns can be selected by the operator of the model.
For instance, in a smog forecasting system, we might be a lot more concerned with having low false negatives than low false positives. A false negative would imply not warning with regards to a smog day when as a matter of fact, it is a high smog day, causing health issues amongst public that are not able to undertake precautions.
A false positive implies the public would take precautionary measures when they didn’t need to.
A typical fashion to contrast models that forecast probabilities for two-class problems is to leverage a ROC curve.
What are ROC Curves?
A useful utility when forecasting the probability of a binary result is the Receiver Operating Characteristic Curve, or ROC curve.
It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of differing candidate threshold values between 0.0 and 1.0. To put it in a different way, it plots the false alarm rate vs. the hit rate.
The true positive rate is calculated as the number of true positives divided by the total of the number of true positives and the number of false negatives. It details how good the model is at forecasting the positive class when the actual result is positive.
True Positive Rate = True Positives / (True Positives + False Negatives)
The true positive rate is also referenced to as sensitivity.
Sensitivity = True Positives / (True Positives + False Negatives)
The false positive rate is quantified as the number of false positives divided by the total of the number of false positives and the number of true negatives.
It is also referred to as the false alarm rate as it summarizes how frequently a positive class is forecasted when the actual outcome is negative.
False Positive Rate = False Positives / (False Positives + True Negatives)
The False Positive Rate is also referenced to as the inverted specificity where specificity is the cumulative number of true negatives divided by the total of the number of true negatives and false positives.
Specificity = True Negatives / (True Negatives + False Positives)
Where,
False Positive Rate = 1 – Specificity
The ROC Curve is a good utility for a few reasons:
- The curves of differing models can be contrasted directly in general or for differing thresholds.
- The area under the curve (AUC) can be leveraged as a summarization of the model skill.
The shape of the curve consists a ton of data, which includes what we wight care about most for a problem, the predicted/expected false positive rate, and the false negative rate.
To make this obvious:
- Smaller values on the x-axis of the plot signify lower false positives and higher true negatives.
- Bigger values on the y-axis of the plot signify higher true positives and lower false negatives.
If you are confused, remember, when we forecast a binary result, it is either a right prediction (true positive) or not (false positive). There is a tension amongst these options, the same with true negative and false negative.
A skilful model will allocated an increased probability to an arbitrarily selected real positive occurrence than a negative occurrence on average. This is what we imply when we state that the model has skill. Typically, skilful models are indicated by curves that bow up to the top left of the plot.
A no-skill classifier that one can’t discriminate amongst the classes and would forecast an arbitrary class or a constant class in all scenarios. A model with nil skill is indicated at the point (0.5, 0.5). A model has no skill at every threshold is indicated by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5.
A model with ideal skill is indicated at a point (0,1). A model with ideal skill is indicated by a line that travels from the bottom left of the plot to the top left and then across the top to the top right.
An operator might plot the ROC curve for the final model and select a threshold that provides a desirable balance amongst the false positives and false negatives.
ROC Curves and AUC in Python
We can plot an ROC curve for a model in Python leveraging the roc_curve() scitkit-learn function.
The function takes both the true outcomes (0,1) from the evaluation set and the forecasted odds for the 1 class. The function returns the false positive rates for every threshold, true positive rates for every threshold and thresholds.
1 2 3 | … # calculate roc curve fpr, tpr, thresholds = roc_curve(y, probs) |
The AUC for the ROC can be calculated leveraging the roc_auc_score() function.
Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the evaluation set and the forecasted probabilities for the 1 class. It returns the AUC score between 0.0 and 1.0 for no skill and perfect skill respectively.
1 2 3 4 | … # calculate AUC auc = roc_auc_score(y, probs) print(‘AUC: %.3f’ % auc) |
A total instance of calculating the ROC curve and ROC AUC for a Logistic Regression model on a small test problem is detailed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | # roc curve and auc from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # generate a no skill prediction (majority class) ns_probs = [0 for _ in range(len(testy))] # fit a model model = LogisticRegression(solver=’lbfgs’) model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores ns_auc = roc_auc_score(testy, ns_probs) lr_auc = roc_auc_score(testy, lr_probs) # summarize scores print(‘No Skill: ROC AUC=%.3f’ % (ns_auc)) print(‘Logistic: ROC AUC=%.3f’ % (lr_auc)) # calculate roc curves ns_fpr, ns_tpr, _ = roc_curve(testy, ns_probs) lr_fpr, lr_tpr, _ = roc_curve(testy, lr_probs) # plot the roc curve for the model pyplot.plot(ns_fpr, ns_tpr, linestyle=’–‘, label=’No Skill’) pyplot.plot(lr_fpr, lr_tpr, marker=’.’, label=’Logistic’) # axis labels pyplot.xlabel(‘False Positive Rate’) pyplot.ylabel(‘True Positive Rate’) # show the legend pyplot.legend() # show the plot pyplot.show() |
Running the instance prints the ROC AUC for the logistic regression model and the no skill classifier that only forecasts 0 for all instances.
No Skill: ROC AUC=0.500
Logistic: ROC AUC=0.903
A plot of the ROC curve for the model is also developed displaying that the model has skill.
Your outcomes might demonstrate variance provided that the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
What Are Precision-Recall Curves?
There are several ways to assess the skill of a forecasting model.
A strategy in the connected domain of information retrieval (identifying documents on the basis of queries) measures accuracy and recall.
These measures are also good within applied machine learning for assessing binary classification models.
Precision is a ratio of the number of true positives divided by the total of the true positives and false positives. It details how good a model is at forecasting the positive class. Precision is referenced as the positive predictive value.
Positive Predictive Power = True Positives / (True Positives + False Positives)
Or
Precision = True Positives / (True Positives + False Positives)
Recall is calculated as the ratio of the number of true positives divided by the total of the true positives and the false negatives. Recall is the same as sensitivity.
Recall = True Positives / (True Positives + False Negatives)
Or
Sensitivity = True Positives / (True Positives + False Negatives)
Recall == Sensitivity
Review of both precision and recall is good in scenarios where there is an imbalance in the observations between the two categories. Particularly, there are several instances of no event (class 0) and only a few instances of an event (class 1).
The reason for this is that usually the large number of class 0 instances implies we are less interested in the skill of the model at forecasting class 0 rightly, for example, high true negatives.
Critical to the calculation of precision and recall is that the calculations do not leverage the true negatives. It only concerns itself with the right prediction of the minority class, class 1.
A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for differing thresholds, a lot like the ROC curve.
A no-skill classifier is one that cannot discriminate amongst the classes and would forecast an arbitrary class or a constant class in all scenarios. The no-skill classifier is one that can’t discriminate amongst the classes and would forecast an arbitrary classes or a constant class in all scenarios. The no-skill line changes on the basis of the distribution of the positive to negative classes. It is a horizontal line with the value of the ratio of positive scenarios in the dataset. For a balanced dataset, this 0.5
While the baseline is fixed with ROC, the baseline of [precision-recall curve] is decided by the ratio of positives (P) and negatives (N) as y = P / (P + N). For example, we have y = 0.5 for a balanced class distribution.
A model with ideal skill is depicted as a point at (1,1). A skilful model is indicated by a curve that bows towards (1,1) above the flat line of no skill.
There are also composite scores that attempt to summarize the accuracy and recall, dual instances include:
- F-measure or F1 score: that calculates the harmonic mean of the precision and recall (harmonic mean as the precision and recall are rates)
- Area under curve: like the AUC, summarizes the integral or an approximation of the region under the precision-recall curve.
In terms of model choice, F-measure summarizes model skill for a particular probability threshold (example 0.5), while the area under curve summarize the skill of a model throughout thresholds, like ROC AUC.
This makes precision-recall and a plot of precision vs. recall and summary quantifies useful utilities for binary classification problems that possess an imbalance in the observations for every class.
Precision-recall curves in Python
Precision and recall can be calculated in scikit-learn.
The precision and recall can be calculated for thresholds leveraging the precision_recall_curve() function that takes the true output values and the odds for the positive class as input and returns the precision, recall and threshold values.
1 2 3 | … # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) |
The F-measure can be calculated by calling the f1_score() function that takes the true class values and the predicted/expected class values as arguments.
1 2 3 | … # calculate F1 score f1 = f1_score(testy, yhat) |
The region under the precision-recall curve can be approximated by calling the auc() function and passing it the recall (x) and precision (y) values calculated for every threshold.
1 2 3 | … # calculate precision-recall AUC auc = auc(recall, precision) |
During the plotting of precision and recall for every threshold as a curve, it is critical that recall is furnished as the x-axis and precision is furnished as the y-axis.
The full instance of calculating precision-recall curves for a Logistic Regression model is detailed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # precision-recall curve and f1 from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver=’lbfgs’) model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # predict class values yhat = model.predict(testX) lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs) lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision) # summarize scores print(‘Logistic: f1=%.3f auc=%.3f’ % (lr_f1, lr_auc)) # plot the precision-recall curves no_skill = len(testy[testy==1]) / len(testy) pyplot.plot([0, 1], [no_skill, no_skill], linestyle=’–‘, label=’No Skill’) pyplot.plot(lr_recall, lr_precision, marker=’.’, label=’Logistic’) # axis labels pyplot.xlabel(‘Recall’) pyplot.ylabel(‘Precision’) # show the legend pyplot.legend() # show the plot pyplot.show() |
Running the instance first prints the F1, area under curve (AUC) for the logistic regression model.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
Logistic: f1=0.841 auc=0.898
The precision-recall curve plot is then developed displaying the precision/recall for every threshold for a logistic regression model (orange) contrasted to a no skill model (blue).
When to leverage ROC v. Precision-Recall Curves
Typically, the leveraging of ROC curves and precision-recall curves are as follows:
- ROC curves ought to be leveraged when they are roughly equal numbers of observations for every class.
- Precision-recall curves should be leveraged when there is moderate to large scale imbalance.
The reasoning behind this rec is that ROC curves put forth an optimistic picture of the model on datasets with a class imbalance.
However, ROC curves can put forth an overly optimistic perspective of an algorithm’s performance if there is a massive skew in the class distribution. Precision-recall (PR) curves, often leveraged in information retrieval, have been cited as an alternative to ROC curves for activities with a massive skew in the class distribution.
Some go one step further and indicate that leveraging a ROC curve with an imbalanced dataset might be deceptive and lead to wrong interpretations of the model skill.
The visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with regard to conclusions about the reliance of classification performance, due to an intuitive but incorrect interpretation of specificity. [Precision-recall curve] plots, on the other hand, can furnish the viewer with a precise forecast of future classification performance owing to the fact that they evaluate the fraction of true positives amongst positive predictions.
The main reason for this optimistic picture is due to the leveraging of true negatives in the False Positive Rate in the ROC curve and the meticulous avoidance of this rate in the Precision-Recall curve.
If the proportion of positive to negative instances modifies in an evaluation set, the ROC curves will not alter. Metrics like precision, accuracy, lift and F scores leverage values from both columns of the confusion matrix. As a class distribution modifies these measures will alter as well, even if the basic classifier performance does not. ROC graphs are based upon TP rate and FP rate, in which every dimension is a strict columnar ratio, so do not be dependent on class distributions.
We can make this concrete with a short instance.
Below is the same ROC curve instance with an altered problem where there is a ratio of about 100:1 ratio of class=0 to class=1 observations (particularly Class0=985, Class1=15).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | # roc curve and auc on an imbalanced dataset from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99,0.01], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # generate a no skill prediction (majority class) ns_probs = [0 for _ in range(len(testy))] # fit a model model = LogisticRegression(solver=’lbfgs’) model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores ns_auc = roc_auc_score(testy, ns_probs) lr_auc = roc_auc_score(testy, lr_probs) # summarize scores print(‘No Skill: ROC AUC=%.3f’ % (ns_auc)) print(‘Logistic: ROC AUC=%.3f’ % (lr_auc)) # calculate roc curves ns_fpr, ns_tpr, _ = roc_curve(testy, ns_probs) lr_fpr, lr_tpr, _ = roc_curve(testy, lr_probs) # plot the roc curve for the model pyplot.plot(ns_fpr, ns_tpr, linestyle=’–‘, label=’No Skill’) pyplot.plot(lr_fpr, lr_tpr, marker=’.’, label=’Logistic’) # axis labels pyplot.xlabel(‘False Positive Rate’) pyplot.ylabel(‘True Positive Rate’) # show the legend pyplot.legend() # show the plot pyplot.show() |
Running the instance indicates that the model has skill.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure or variations in numerical accuracy. Consider executing the instance a few times and contrast the average outcome.
1 2 | No Skill: ROC AUC=0.500 Logistic: ROC AUC=0.716 |
Indeed, it possesses skill, but all of that skill is measured as making right true negative forecasts and there are a ton of negative predictions to make.
If you go through these predictions, you will observe that the model forecasts the majority class (class 0) in all scenarios on the test set. The score is really misleading.
A plot of the ROC curves provides confirmations of the AUC interpretation of a skilful model for most probability thresholds.
We can also repeat the evaluation of the same model on the same dataset and calculate a precision-recall curve and stats instead.
The full instance is detailed below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # precision-recall curve and f1 for an imbalanced dataset from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99,0.01], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver=’lbfgs’) model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision and recall for each threshold lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs) # calculate scores lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision) # summarize scores print(‘Logistic: f1=%.3f auc=%.3f’ % (lr_f1, lr_auc)) # plot the precision-recall curves no_skill = len(testy[testy==1]) / len(testy) pyplot.plot([0, 1], [no_skill, no_skill], linestyle=’–‘, label=’No Skill’) pyplot.plot(lr_recall, lr_precision, marker=’.’, label=’Logistic’) # axis labels pyplot.xlabel(‘Recall’) pyplot.ylabel(‘Precision’) # show the legend pyplot.legend() # show the plot pyplot.show() |
Running the instance first prints the F1 and AUC scores.
Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical precision. Consider running the instance a few times and contrast the average outcome.
We can observe that the model is penalized for forecasting the majority class in all scenarios. The scores display that the model that looked good according to the ROC curve is as a matter of fact barely skilful when considered leveraging precision and recall that concentrate on the positive class.
1 | Logistic: f1=0.000 auc=0.054 |
The plot of the precision-recall curve highlights that the model is just a little bit above the no skill line for a majority of thresholds.
This is feasible as the model forecasts odds and is uncertain about some scenarios. These get exposed through the differing thresholds assessed in the construction of the curve, flipping some class 0 to class 1, providing some precision but very minimal recall.
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Papers
- A critical investigation of recall and precision as measures of retrieval system performance, 1989.
- The relationship between Precision-Recall and ROC Curves, 2006
- The Precision-Recall Plot is more informative than the ROC plot when evaluating Binary Classifiers on Imbalanced Datasets, 2015
- ROC Graphs: Notes and practical considerations for data mining researchers, 2003.
API
- metrics.roc_curve API
- metrics.roc_auc_score API
- metrics.precision_recall_curve API
- metrics.auc API
- metrics.average_precision_score API
- Precision-Recall, scikit-learn
- Precision, recall, and F-measures, scikit-learn
Article
- Receiver operating characteristic on Wikipedia
- Sensitivity and specificity on Wikipedia
- Precision and recall on Wikipedia
- Information retrieval on Wikipedia
- F1 score on Wikipedia
- ROC and precision-recall with imbalanced datasets, blog.
Conclusion
In this guide, you found out about ROC curves, Precision-Recall Curves, and when to leverage each to interpret the forecasting of odds for binary classification problems.
Particularly, you learned:
- ROC curves summarize the trade-off between the true positive rate and false positive rate for a predictive model leveraging differing probability thresholds
- Precision-recall curves summarize the trade-off amongst the true positive rate and the positive predictive value for a predictive model leveraging differing probability thresholds.
- ROC curves are relevant when the observations are balanced amongst every class, whereas precision-recall curves are relevant for imbalanced datasets.