### Evaluating and contrasting classifier performance with ROC curves

The most typically reported measure of classifier performance is precision: the percentage of accurate classifications gathered.

This metric contains the benefit of being simple to comprehend and makes comparison of the performance of differing classifiers trivial, but it glosses over several of the factors which ought to be taken into consideration when honestly evaluating the performance of a classifier.

__What is meant by classifier performance?__

Classifier performance is more than merely a count of accurate classifications.

Take up, for interest, the issue of screening for a comparatively uncommon condition like cancer, which contains a prevalence rate of approximately 10% (actuals stats). If a lazy Pap smear screener was to categorize each slide they observe as “normal”, they would possess a 90% precision. This is really impressive. But that figure totally glosses over the fact that the 1/10^{th} of women who do have the illness have not received a diagnosis at all.

__A Few Performance Metrics__

In a prior blog article, we detailed some of the other performance metrics which can be applied to the assessment of a classifier. To review.

Most classifiers generate a score, which is then thresholded to determine the classification. If a classifier generates a score between 0.0 (certainly negative) and 1.0 (certainly positive), it is typical to consider anything over 0.5 as positive.

Althought, any threshold applied to a dataset (in which PP is the positive population and NP is the negative population) is going to generate true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) (Figure 1). We require a strategy which will take into consideration all of these numbers.

After you have numbers for all of these measures, some good metrics can be calculated

- Accuracy = (1-Error) = (TP+TN)/(PP+NP) = Pr(C), the odds of a correct classification.
- Sensitivity = TP/(TP+FP) = TN/NP = the capability of test to precisely rule out the disease in a disease-free population.

Let’s go about calculating these metrics for some reasonable real-world numbers. If we possess 100,000 patients, of which 200 (20%) actually are afflicted with cancer, we might observe the subsequent test outcomes (Table 1):

For this information:

- Sensitivity = TP/(TP+FN) = 160 / (160 +40) = 80.0%
- Specificity = TN/(TN+FP) = 69,860/(69,860 + 29,940) = 70.0%.

To put it in different words, our test will precisely detect 80% of individuals with the illness, but 3/10ths of healthy people will wrongly test positive. By only looking at the sensitivity, or precision of the test, possibly critical data is lost.

By taking up our incorrect outcomes as well as our right ones we get much bigger insight into the performance of the classifier.

One method to surpass the problem of having to select a cutoff is to begin with a threshold of 0.0, so that each case is viewed as positive. We rightly classify all of the positive cases, and wrongly categorize all of the negative cases. We can then shift the threshold over each value between 0.0 and 1.0, progressively reducing the number of false positives and increasing the number of actual positives.

TP(sensitivity) can subsequently be plotted against FP (1-specificity) for every threshold leveraged. The outcome graph is referred to as a Receiver Operating Characteristic (ROC) curve (Figure 2). ROC curves were generated for leveraging in signal detection in radar returns in the 1950s, and have since then had application to a broader range of issues.

For an ideal classifier the ROC curve will go straight up the Y axis and the along the X axis. A classifier with no power will sit on the diagonal, whilst most classifiers fall somewhere in between.

ROC analysis furnishes utilities to choose potentially optimal models and to discard suboptimal ones autonomously from (and before mentioning) the cost context or the class distribution.

__Leveraging ROC Curves__

**Threshold Selection**

It is quickly apparent that an ROC curve can be leveraged to choose a threshold for a classifier which maximises the actual positives, while minimising the false positives.

Although, differing variants of problems have differing optimal classifier thresholds. With regards to a cancer screening test, for instance, we might be prepped to put up with a comparatively high positive rate in order to obtain a high true positive, it is most critical to detect potential cancer sufferers.

For a follow-up test after treatment, although, a differing threshold might be more desirable, as we wish to minimise false negatives, we don’t wish to tell a patient they’re clear if this not literally the scenario.

**Performance Assessment**

ROC curves also provide us the capability to evaluate the performance of the classifier over its complete operating range. The mostly broadly-leveraged measure is the area under the curve (AUC). As you can observe from Figure 2, the AUC for a classifier with no power, basically arbitrary guessing, is 0.5, as the curve follows the diagonal. The AUC for that mythical being, the ideal classifier is 1.0. A majority of classifiers possess AUCs that fall somewhere between these two values.

An AUC of less than 0.5 could signify that something fascinating is occurring. A really low AUC might signify that the problem has been set up incorrectly, the classifier is identifying a relationship in the data which is, basically, the opposite of that expected. In such a scenario, inspection of the complete ROC curve might provide some cues as to what is occurring: have the positives and negatives undergone mislabelling?

**Classifier Comparison**

The AUC can be leveraged to contrast the performance of dual or additional classifiers. A singular threshold can be chosen and the classifier’s performance at the juncture compared, or the cumulative performance can be contrasted by considering the AUC.

Most reports that have been put out compare AUCs in absolute terms. “Classifier 1has an AUC of 0.85, and classifier 2 has an AUC of 0.79, so classifier 1 is obviously better. It is, although to calculate whether variations in AUC are considerably significant. For complete details, see the Hanley & McNeil (1982) paper detailed below.

**ROC Curve Analysis Tutorials**

A guide on generating ROC curves leveraging the SigmaPlot software (PDF)

A YouTube guide for SPSS from TheRMUoHP Biostatistics Resource Channel

Documentation for the pROC package inR (PDF)

**When to leverage ROC Curve Analysis**

In this article we have leveraged a biomedical instance, and ROC curves are broadly leveraged in the biomedical sciences. The strategy is, although, applicable to any classifier generating a scoring for every scenario, instead of a binary decision.

Neural networks and several statistical algorithms are instances of relevant classifiers, while strategies like decision trees and less appropriate. Algorithms which just have two potential results (like the cancer/no cancer instance leveraged here) are most appropriate to this strategy.

Any kind of data which can be inputted into relevant classifiers can be subjected to ROC curve analysis.

__Further Reading__

A classic paper on leveraging ROC curves, old, but still very noteworthy and relevant. Hanley J. A. and B. J. McNeil (1982) “The meaning and use of the area under a receiver operating characteristic (ROC) curve” Radiology 143(1): 29-36

And a good, more latest, review article with a concentration on medical diagnostics: Haijan-Tilaki K. “Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation”. Caspian Journal of Internal Medicine 2013;4(2):627-635

Just to illustrate the leveraging of ROC curves in financial applications: Petro Lisowsky (2010) “Seeking Shelter: Empirically modelling tax shelters using financial statement information”. The Accounting Review: September 2010, Vol. 85, No. 5, pp. 1693-1720