### How to calculate precision, Recall, F1, and more with regards to deep learning models

Upon fitting of a deep learning neural network model, you muswet assess its performance on an evaluation dataset.

This is crucial, as the reported performance enables you to both select between candidate models and to communicate to stakeholders about how functional the model is at finding solutions to the problem.

The Keras deep learning API model is really restricted in terms of the metrics that you can leverage to report the model performance.

Questions that often come up are:

“How can we calculate the precision and recall for my model?”

And:

“How can I calculate the F1-score or confusion matrix for my model?”

In this guide, you will find out how to calculate metrics to assess your deep learning neural network model with a step-wise instance:

After going through this guide, you will be aware of:

- How to leverage the scikit-learn metrics API to assess a deep learning model.
- How to make both class and probability forecasts with a final model needed by the scikit-learn API.
- How to calculate precision, recall, F1-score, ROC AUC, and more with the scikit-learn API for a model.

**Tutorial Summarization**

This guide is subdivided into three portions, which are:

- Binary Classification Problem
- Multilayer Perceptron Model
- How to calculate model metrics

**Binary Classification Problem**

We will leverage a standard binary classification problem as the foundation for this guide, referred to as the “two circles” problem.

It is referred to as the two circles issue because the problem is consisted of points that when plotted, display two concentric circles, one for every class. As such, this is an instance of a binary classification problem. The problem contains dual inputs that can be interpreted as x and y coordinates on a graph. Every point belongs to either the inner or outer circle.

The make_circles() function in the scikit-learn library enables you to produce samples from the two circles problem. The “n_samples” argument facilitates you to mention the number of samples to produce, divided evenly amongst the two classes. The “noise” argument enables you to mention how much arbitrary statistical noise is included to the inputs or coordinates of every point, making the classification activity more of a challenge. The random_state argument specifies the seed for the pseudorandom number generator, making sure that the same samples are produced every time the code is executed.

The instance below produces 1,000 samples, with 0.1 statistical noise and a seed of 1.

2 | # generate 2d classification dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) |

Once produced, we can develop a plot of the dataset to obtain an idea of how challenging the classification task is.

The instance below produces samples and plots them, colouring every point according to the class, where points from class 0 (outer circle) are coloured blue and points from class 1 (inner circle) are coloured orange.

1 2 3 4 5 6 7 8 9 10 11 | # Example of generating samples from the two circle problem from sklearn.datasets import make_circles from matplotlib import pyplot from numpy import where # generate 2d classification dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # scatter plot, dots colored by class value for i in range(2): samples_ix = where(y == i) pyplot.scatter(X[samples_ix, 0], X[samples_ix, 1]) pyplot.show() |

Running the instance produces the dataset and plots the points on a graph, obviously displaying two concentric circles for points from class 0 and class 1.

**Multilayer Perceptron Model**

We will produce a Multilayer Perceptron Model, or MLP, model to address the binary classification problem.

The model has not undergone optimization for the problem, but it is skilful. (better than arbitrary)

After the samples for the dataset are produced, we will split them into two equal portions, one leveraged for training of the model and one for assessing the trained model.

1 2 3 4 | # split into train and test n_test = 500 trainX, testX = X[:n_test, :], X[n_test:, :] trainy, testy = y[:n_test], y[n_test:] |

Then, we can define our MLP model. The model is simple, expecting/predicting 2 input variables from the dataset, a singular hidden layer with 100 nodes, and a ReLU activation function, then an output layer with a singular node and a sigmoid activation function.

The model will forecast a value between 0 and 1 that will be interpreted as to if the input instance is from class 0 or class 1.

1 2 3 4 | # define model model = Sequential() model.add(Dense(100, input_dim=2, activation=’relu’)) model.add(Dense(1, activation=’sigmoid’)) |

The model will be fit leveraging the binary cross entropy loss function and we will leverage the effective Adam version of stochastic gradient descent. The model will additionally monitor the classification precision metric.

# compile model

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

We will fit the model for 300 training epochs with the default batch size of 32 samples and assess the performance of the model at the conclusion of every training epoch on the evaluation dataset.

# fit model

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0)

At the conclusion of training, we will assess the final model another time on the train and evaluation datasets and report the classification precision.

1 2 3 | # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) |

Lastly, the performance of the model on the train and evaluation sets documented during training will be graphed leveraging a line plot, one for each of the loss and the classification precision.

1 2 3 4 5 6 7 8 9 10 11 12 13 | # plot loss during training pyplot.subplot(211) pyplot.title(‘Loss’) pyplot.plot(history.history[‘loss’], label=’train’) pyplot.plot(history.history[‘val_loss’], label=’test’) pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title(‘Accuracy’) pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.legend() pyplot.show() |

Connecting all of these factors together, the complete code listing of training and assessing an MLP on the dual circles problem is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | # multilayer perceptron model for the two circles problem from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_test = 500 trainX, testX = X[:n_test, :], X[n_test:, :] trainy, testy = y[:n_test], y[n_test:] # define model model = Sequential() model.add(Dense(100, input_dim=2, activation=’relu’)) model.add(Dense(1, activation=’sigmoid’)) # compile model model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc)) # plot loss during training pyplot.subplot(211) pyplot.title(‘Loss’) pyplot.plot(history.history[‘loss’], label=’train’) pyplot.plot(history.history[‘val_loss’], label=’test’) pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title(‘Accuracy’) pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.legend() pyplot.show() |

Running the instance fits the model very swiftly on the CPU. (no GPU is needed)

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Consider running the instance a few times and contrast the average outcome.

The model is assessed, reporting the classification precision on the train and evaluation sets of approximately 83% and 85% respectively.

Train: 0.838, Test: 0.850

A figure is developed displaying two line plots: one for the learning curves of the loss on the train and evaluation sets and one for the classification on the train and test sets.

The plots indicate that the model has a good fit on the problem.

**How to Calculate Model Metrics**

Probably you are required to assess your deep learning neural network model leveraging extra metrics that are not compatible by the Keras metrics API.

The Keras metrics API is restricted and you might wish to calculate metrics like accuracy, recall, F1, and more.

One strategy to calculating new metrics is to go about implementing them yourself in the Keras API and have Keras calculate them for you during model training and during model assessment.

This can be a technical challenge.

A much easier alternative is to leverage your final model to make a forecast for the evaluation test dataset, then calculate any metric you desire leveraging the scikit-learn metrics API.

A trio of metrics, on top of classification precision, that are usually needed for a neural network model on a binary classification problem are:

- Precision
- Recall
- F1 score

In this portion of the blog, we will calculate these trio of metrics, as well as classification precision leveraging the scikit-learn metrics API, and we will additionally calculate a trio of additional metrics that are less widespread but might be useful. They are:

- Cohen’s Kappa
- ROC AUC
- Confusion Matrix

This is not a full listing of metrics with regards to classification model compatible with scikit-learn, nonetheless, calculating these metrics will illustrate to you how to calculate any metrics you might need leveraging the scikit-learn API.

The instance in this section will calculate metrics for an MLP model, however the same code for calculating metrics can be leveraged for other models, like RNNs and CNNs.

We can leverage the same code from the prior sections for prepping the dataset, in addition to defining and fitment of the model. To make the instance simpler, we will put the code for these steps into simple function.

To start with, we can give definition to a function referred to as get_data() that will produce the dataset and split it into train and test sets.

1 2 3 4 5 6 7 8 9 | # generate and prepare the dataset def get_data(): # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_test = 500 trainX, testX = X[:n_test, :], X[n_test:, :] trainy, testy = y[:n_test], y[n_test:] return trainX, trainy, testX, testy |

We can then call the get_data() function to prep the dataset and the get_model() function to fit and return the model.

1 2 3 4 | # generate data trainX, trainy, testX, testy = get_data() # fit model model = get_model(trainX, trainy) |

Now that we possess a model fitted on the training dataset, we can assess it leveraging metrics from the scikit-learn metrics API.

To start with, we must leverage the model to make forecasts. A majority of the metric functions need a comparison between the true class values (for example, testy) and the forecasted class values (yhat_classes). We can forecast the class values directly with our model leveraging the predict_classes() function on the model.

A few metrics, like the ROC AUC, need a forecast of class probabilities (yhat_probs). These can be recovered by calling the predict() function on the model.

We can make the class probability forecasts with the model.

1 2 3 4 | # predict probabilities for test set yhat_probs = model.predict(testX, verbose=0) # predict crisp classes for test set yhat_classes = model.predict_classes(testX, verbose=0) |

The forecasts are returned in a 2D array, with a single row for every instance in the evaluation dataset and a single column for the forecast.

The scikit-learn metrics API predicts/expects a 1D array of actual and predicted/forecasted values for comparison, thus, we must minimize the 2D prediction arrays to 1D arrays.

1 2 3 | # reduce to 1d array yhat_probs = yhat_probs[:, 0] yhat_classes = yhat_classes[:, 0] |

We are now ready to calculate metrics for our deep learning neural network model. We can begin by calculating the classification accuracy, precision, recall, and F1 scores.

1 2 3 4 5 6 7 8 9 10 11 12 | # accuracy: (tp + tn) / (p + n) accuracy = accuracy_score(testy, yhat_classes) print(‘Accuracy: %f’ % accuracy) # precision tp / (tp + fp) precision = precision_score(testy, yhat_classes) print(‘Precision: %f’ % precision) # recall: tp / (tp + fn) recall = recall_score(testy, yhat_classes) print(‘Recall: %f’ % recall) # f1: 2 tp / (2 tp + fp + fn) f1 = f1_score(testy, yhat_classes) print(‘F1 score: %f’ % f1) |

Observe that calculating a metric is as easy as selecting the metric that concerns us and calling the function passing in the true class (testy) and the forecasted class values (yhat_classes)

We can additionally calculate some additional metrics, like Cohen’s kappa, ROC AUC, and confusion matrix.

Observe that the ROC AUC needs the forecasted class probabilities (yhat_probs) as an argument instead of the forecasted classes (yhat_classes)

1 2 3 4 5 6 7 8 9 | # kappa kappa = cohen_kappa_score(testy, yhat_classes) print(‘Cohens kappa: %f’ % kappa) # ROC AUC auc = roc_auc_score(testy, yhat_probs) print(‘ROC AUC: %f’ % auc) # confusion matrix matrix = confusion_matrix(testy, yhat_classes) print(matrix) |

Now that we are aware of how to calculate metrics for a deep learning neural network leveraging scikit-learn API, we can connect all of these elements together into a full instance, detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | # demonstration of calculating metrics for a neural network model using sklearn from sklearn.datasets import make_circles from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import f1_score from sklearn.metrics import cohen_kappa_score from sklearn.metrics import roc_auc_score from sklearn.metrics import confusion_matrix from keras.models import Sequential from keras.layers import Dense
# generate and prepare the dataset def get_data(): # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_test = 500 trainX, testX = X[:n_test, :], X[n_test:, :] trainy, testy = y[:n_test], y[n_test:] return trainX, trainy, testX, testy
# define and fit the model def get_model(trainX, trainy): # define model model = Sequential() model.add(Dense(100, input_dim=2, activation=’relu’)) model.add(Dense(1, activation=’sigmoid’)) # compile model model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model model.fit(trainX, trainy, epochs=300, verbose=0) return model
# generate data trainX, trainy, testX, testy = get_data() # fit model model = get_model(trainX, trainy)
# predict probabilities for test set yhat_probs = model.predict(testX, verbose=0) # predict crisp classes for test set yhat_classes = model.predict_classes(testX, verbose=0) # reduce to 1d array yhat_probs = yhat_probs[:, 0] yhat_classes = yhat_classes[:, 0]
# accuracy: (tp + tn) / (p + n) accuracy = accuracy_score(testy, yhat_classes) print(‘Accuracy: %f’ % accuracy) # precision tp / (tp + fp) precision = precision_score(testy, yhat_classes) print(‘Precision: %f’ % precision) # recall: tp / (tp + fn) recall = recall_score(testy, yhat_classes) print(‘Recall: %f’ % recall) # f1: 2 tp / (2 tp + fp + fn) f1 = f1_score(testy, yhat_classes) print(‘F1 score: %f’ % f1)
# kappa kappa = cohen_kappa_score(testy, yhat_classes) print(‘Cohens kappa: %f’ % kappa) # ROC AUC auc = roc_auc_score(testy, yhat_probs) print(‘ROC AUC: %f’ % auc) # confusion matrix matrix = confusion_matrix(testy, yhat_classes) print(matrix) |

Your outcomes might demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Consider running the instance a few times and contrast the average outcome.

Running the instance preps the dataset, fits the model, then calculates and reports the metrics for the model assessed on the evaluation dataset.

1 2 3 4 5 6 7 8 | Accuracy: 0.842000 Precision: 0.836576 Recall: 0.853175 F1 score: 0.844794 Cohens kappa: 0.683929 ROC AUC: 0.923739 [[206 42] [ 37 215]] |

**Further Reading**

This section furnishes additional resources on the subject if you are seeking to delve deeper.

**API**

- metrics: Metrics API
- Classification Metrics Guide
- Keras Metrics API
- datasets.make_circles API

**Articles**

- Evaluation of binary classifiers, Wikipedia
- Confusion matrix, Wikipedia
- Precision and recall, Wikipedia

**Conclusion**

In this guide, you found out how to calculate metrics to assess your deep learning neural network model with a step-wise instance.

Particularly, you learned:

- How to leverage the scikit-learn metrics API to assess a deep learning model.
- How to make both class and probability forecasts with a final model needed by the scikit-learn API.
- How to calculate precision, recall, F1-score, ROC, AUC, and more with the scikit-learn API for a model.