Classification accuracy is inadequate: More performance measures you can leverage
When you develop a model for a classification problem you almost always wish to look at the precision of that model as the number of correct predictions from all predictions made.
This is referred to as the classification accuracy.
We will be looking at assessing the robustness of a model for making forecasts on unobserved data leveraging cross-validation and several cross-validation where we leveraged classification precision and average classification precision.
After you have a model that you believe can make robust forecasts you require to decide if it is an adequate model to find a solution to your problem. Classification precision alone is usually not enough data to make this decision.
In this blog post, we will look at precision and recall performance measures you can leverage to assess your model for a binary classification problem.
Recurrence of breast cancer
The breast cancer dataset is a standard machine learning dataset. It consists of nine attributes detailing 286 women that have underwent and survived breast cancer and if or not the cancer witnessed recurrence within the span of half-a-decade.
It is a binary classification problem. Of the 286 women, 201 did not suffer a recurrence of breast cancer, leaving the remaining 85 that did.
The False negatives are likely worse than False positives for this issue. More comprehensive screening can clear the false positives, but false negatives are delivered home and lost to follow-up evaluation.
Classification accuracy is our beginning point. It is the number of right forecasts made divided by the cumulative number of forecasts made, multiplied by 100 to turn it into a percentage.
All No Recurrence
A model that just forecasted no recurrence of breast cancer would accomplish a precision of (201/286)*100 or 70.28% We’ll call this our “All no Recurrence”. This is a high precision, but a bad model. If it was leveraged alone for decision support to inform doctors (impossible, but play along) it would send back home 85 women with wrongly thinking their breast cancer was not going to recur on account of false negatives.
A model that just forecasted the recurrence of breast cancer would accomplish a precision of (85/286)*100 or 29.72%. We’ll refer to this as our “All Recurrence”. This model has woeful precision and would send home 201 women with the belief that they had a recurrence of breast cancer but actually didn’t. (high false positives)
CART or Classification and Regression Trees is a capable yet simplistic decision tree algorithm. On this problem, CART can accomplish a precision of 69.23%. This is lesser than our “All No Recurrence” model, but is this model more valuable?
We can observe that the classification precision alone is not adequate to choose a model for this problem.
A clean and unambiguous way to put forth the prediction outcomes of a classifier is to leverage a confusion matrix (also referred to as a contingency table)
For a binary classification problem the table possesses 2 rows and 2 columns. Across the top is the observed class labels and down the side are the forecasted class labels. Every cell contains the number of forecasts made by the classifier that fall into that cell.
Positive True +ve False +ve
Negative False -ve True -ve
In this scenario, an ideal classifier would accurately forecast 201 no recurrence and 85 recurrence which would be entered into the top left cell no recurrence/no recurrence (True negatives) and bottom right cell recurrence/recurrence (True Positives)
Wrong predictions are clearly broken down into the two other cells. False -ves which are recurrence that the classifier as marked as no recurrence. We do not possess any of those. False positives are no recurrence that the classifier has denoted as recurrence.
This is a good table that puts forth the class distribution in the data and the classifiers forecasted class distribution with a breakdown of error types.
All No Recurrence Confusion Matrix
The confusion matrix illustrates the large number (85) of False Negatives.
Recurrence No Recurrence
Recurrence 0 0
No Recurrence 85 201
All Recurrence Confusion Matrix
The confusion matrix illustrates the large number (201) of false positives.
CART Confusion Matrix
This appears like a more valuable classifier as it accurately forecasted 10 recurrence events as well as 188 no recurrence events. The model additionally displays a modest number of False negatives (75) and False positives (13).
As we can observe in this instance, precision can be misdirecting. At times it might be desirable to choose a model with a lower precision as it has a bigger predictive power on the problem.
For instance, in a problem where there is a large class imbalance, a model can forecast the value of the majority class for all forecasts and accomplish a high classification precision, the problem is that this model is not useful in the problem domain. As we observed in our breast cancer instance.
This is referred to as the Accuracy Paradox. For problems like, these extra measures are needed to evaluate a classifier.
Precision is the number of true positives divided by the number of true positives and false positives. Putting it differently, it is the number of +ve forecasts divided by the total number of +ve class values forecasted. It is also referred to as the Positive Predictive Value (PPV).
Precision can be perceived of as a measure of a classifiers exactness. A low accuracy can also signify a large number of false positives.
- The precision of the All No Recurrence model is 0/(0+0) or not a number, or 0.
- The precision of the all recurrence model is 85/(85+201) or 0.30
- The precision of the CART model is 10/(10+13) or 0.43
The precision indicates CART is an improved model and that the All Recurrence is more useful than the All No Recurrence model even though it has a reduced precision. The difference in accuracy amongst the All Recurrence model and the CART can be explained by the massive number of false positives forecasted by the All Recurrence model.
Recall is the number of true positives divided by the number of true positives and the number of false negatives. To put it in a different way it is the number of +ve forecasts divided by the number of +ve class values in the evaluation data. It is also referred to as the Sensitivity of the True Positive Rate.
Recall can be perceived of as a measure of a classifier’s completeness. A low recall signifies several false negatives.
- The recall of the All No Recurrence model is 0/(0+85) or 0
- The recall of the All Recurrence model is 85/(85+0) or 1.
- The recall of CART is 10/(10+75) or 0.12
As you would expect, the All Recurrence model has a perfect recall as it forecasts “recurrence” for all examples. The recall for CART is lower than that of the All Recurrence model. This can be explained by the massive number (75) of False Negatives forecasted by the CART model.
The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also referred to as the F score or the F measure. To put it differently, the F1 score conveys the balance between the accuracy and the recall.
- The F1 for the All No Recurrence model is 2*((0*0)/0+0) or 0.
- The F1 for the All Recurrence model is 2*((0.3*1)/0.3+1) or 0.46.
- The F1 for the CART model is 2*((0.43*0.12)/0.43+0.12) or 0.19.
If we are seeking to choose a model on the basis of a balance between accuracy and recall, the F1 measure indicates that All Recurrence model is the one to surpass and that CART model is not yet adequately competitive.
In this blog post, you found out about the Accuracy Paradox and problems with a class imbalance when Classification Accuracy alone cannot be trustworthy to choose an adequately performing model.
Through example, you came to know about the Confusion Matrix as a method of detailing the breakdown of errors in forecasts for an unobserved dataset. You learned about measure that summarize the accuracy (exactness) and recall (completeness) of a model and a description of the balance between the two in the F1 score.