### How to create a bagging ensemble of deep learning models in Keras

Ensemble learning are strategies that bring together the predictions from several models. It is critical within ensemble learning that the models that make up the ens emble are good in differing fashions can have the outcome of a prediction that is both more stable and often improved in contrast to the predictions of any individual member model.

One way to accomplish differences amongst models is to undertake training of every model on a differing subset of the available training information. Models are trained on differing subsets of the training data naturally via the leveraging of resampling methods like cross-validation on unobserved data. The models leveraged in this estimation procedure can be brought together in what is referenced to as a resampling-based ensemble, like a cross-validation ensemble or a bootstrap aggregation (or bagging) ensemble.

In this guide, you will find out how to develop a suite of differing resampling-based ensembles for deep learning neural network models.

After going through this guide, you will be aware of:

- How to estimate model performance leveraging random-splits and generate an ensemble from the models.
- How to estimate performance leveraging 10-fold cross-validation and develop a cross-validation ensemble.
- How to estimate performance leveraging the bootstrap and bring together models leveraging a bagging ensemble.

__Tutorial Summarization:__

This tutorial is subdivided into six portions, which are:

1] Data Resampling Ensembles

2] Multi-class classification problem

3] Single multiplayer perceptron model

4] Random splits ensemble

5] Cross-validation ensemble

6] Bagging ensemble ensemble

__Data Resampling Ensembles__

Bringing together the predictions from several models can have the outcome of more stable forecasts, and in some scenarios, forecasts that have improved performance than any of the contributing models.

Effective ensembles need members that don’t concur. Every member must possess skill (e.g. have better performance than random chance), but ideally, have better performance in differing ways. Technically, we can state that we have preference to ensemble members to have reduced correlation in their predictions, or prediction errors.

One strategy to encourage variations amongst ensembles is to leverage the same learning algorithm on differing training datasets. This can be accomplished by repeatedly resampling a training dataset that is in turn leveraged in training a new model. Several models are fit leveraging slightly differing perspectives on the training data, and, in turn, make differing errors and often more stable and improved predictions when brought together.

We can make references to these methods generally as a data resampling ensembles.

An advantage of this strategy is that resampling methods might be leveraged that do not leverage all instances in the training dataset. Any instances that are not leveraged to fit the model can be leveraged as a test dataset to estimate the generalization error of the selected model configuration.

There are three forerunning resampling methods that we could leverage to develop a resampling ensemble, which are:

- Random splits: The dataset is repeatedly sampled with a random split of the information into train and test sets.
- K-fold Cross-Validation: The dataset is split into k equally sized folds, k models are trained and every fold is provided an opportunity to be leveraged as the holdout set where the model is trained on all pending folds.
- Bootstrap aggregation: Random samples are gathered with replacement and instances not included in a provided sample are leveraged as the test set.

Probably the most broadly leveraged resampling ensemble strategy is bootstrap aggregation, more typically referenced to as bagging. The resampling with substitution enables more difference within the training dataset, biasing the model and, in turn, have the outcome of more difference amongst the forecasts of the outcome models.

Resampling ensemble models makes a few particular assumptions with regards to your project.

- That a robust estimate of model performance on unobserved data is needed, if not, then a single train/test split can be leveraged.
- That there is a possibility for a lift in performance leveraging an ensemble of models, if not, then a singular model fits on all available data can be leveraged.
- That the computational cost of fitment of more than a single neural network model on a sample of the training dataset is not restrictive, if not, all resources should be put into fitting a single model.

Neural network models are very flexible, thus the elevation in performance furnished by a resampling ensemble is not always feasible provided that individual models trained on all available information can have such good performance.

As such, the sweet spot for leveraging a resampling ensemble is the scenario where there is a necessity for a solid estimate of performance and several models can be fit to calculate the estimate, but there is also a necessity for one (or more) of the models developed during the estimate of performance to be leveraged as the final model (for example, a new final model cannot be fitted on all available training data.)

Now that we are acquainted with resampling ensemble methods, we can work through an instance of applying every method in turn.

__Multi-Class Classification Problem__

We will leverage a small multi-class classification problem as the basis to demonstrate a model resampling ensembles.

The scikit-learn class furnishes the make_blobs() function that can be leveraged to create a multi-class classification issue with the prescribed number of samples, input variables, classes and variation of samples within a class.

We leverage this issue with 1,000 examples, with input variables (to indicate the x and y coordinates of the points) and a conventional deviation of 2.0 for points within every group. We will leverage the same arbitrary state (seed for the presudorandom number generator) to make sure that we always get the same 1,000 points.

# generate 2d classification dataset

X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)

The outcomes are the input and output elements of a dataset that we can model. In order to obtain a feeling for the intricacy of the problem, we can plot every point on a 2D scatter plot and color every point by class value.

The full instance is detailed here:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # scatter plot of blobs dataset from sklearn.datasets import make_blobs from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:’red’, 1:’blue’, 2:’green’} fig, ax = pyplot.subplots() grouped = df.groupby(‘label’) for key, group in grouped: group.plot(ax=ax, kind=’scatter’, x=’x’, y=’y’, label=key, color=colors[key]) pyplot.show() |

Running the instance develops a scatter plot of the total dataset. We can observe that the standard deviation of 2.0 implies that the classes are not linearly separable (separable by a line) causing several ambiguous points.

This is okay as it implies that the problem is non-trivial and will enable a neural network model to identify several differing “good enough” candidate solutions have the outcome of increased variance.

__Single Multilayer Perceptron Model__

We will go about defining a Multilayer Perceptron neural network, or MLP, that goes about learning the issue reasonably well.

The issue is a multi-class classification problem, and we will model it leveraging a softmax activation function on the output layer. This implies that the model will forecast a vector with 3 elements with the probability that the sample comes from to every one of the 3 classes. Thus, the first step is to one hot encode the class values.

y = to_categorical(y)

Then, we must split the dataset into training and test sets. We will leverage the test set both to assess the performance of the model and to plot its performance during the course of training with a learning curve. We will leverage 90% of the information for training and 10% for the test set.

We are selecting a large split as it is a noisy problem and a well-performing model needs as much data as doable to learn the complicated classification function.

1 2 3 4 | # split into train and test n_train = int(0.9 * X.shape[0]) trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] |

Then, we can define and combine the model.

The model will look for samples with dual input variables. The model then possesses a singular hidden layer with 50 nodes and rectified linear activation function, then an output layer with 3 nodes to forecast the probability of each one of the 3 classes, and a softmax activation function.

As the issue is multi-class, we will leverage the categorical cross entropy loss function to go about optimizing the model and the efficient Adam flavour of stochastic gradient descent.

1 2 3 4 5 | # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) |

The model is fitted for 50 training epochs and we will assess the model on both the train and the test sets.

1 2 3 4 | # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc)) |

Then lastly, we will plot learning curves of the model precision over every training epoch on both the training and test datasets.

1 2 3 4 5 | # plot history pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.legend() pyplot.show() |

The full instance is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # develop an mlp for blobs dataset from sklearn.datasets import make_blobs from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = int(0.9 * X.shape[0]) trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=50, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc)) # learning curves of model accuracy pyplot.plot(history.history[‘accuracy’], label=’train’) pyplot.plot(history.history[‘val_accuracy’], label=’test’) pyplot.legend() pyplot.show() |

Running the instance first prints the performance of the final model on the train and test datasets.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

In this scenario, we can observe that the model accomplished about 83% precision on the training dataset and about 86% precision on the test dataset.

The selected split of the dataset into train and test sets implies that the test set is minimal and not indicative of the wider problem. In turn, performance on the test set is not indicative of the model; in this scenario, it is optimistically biased.

Train: 0.830, Test: 0.860

A line plot is also developed demonstrating the learning curves for the model precision on the training and test sets over every training epoch.

We can observe that the model has a reasonably stable fit.

__Random Splits Ensemble__

The instability of the model and the small test dataset imply that we aren’t really aware of how well this model will feature performance on new data in a general sense.

We can attempt a simplistic resampling method of repeatedly produce new random splits of the dataset in train and test sets and fit new models. Quantifying the average of the performance of the model across every split will furnish an improved estimate of the model’s generalization error.

We can then bring together several models trained on the random splits with the expectation that performance of the ensemble is probable to be more stable and better than the average singular model.

We will produce 10 times more sample points from the issue domain and hold them back as an unobserved dataset. The assessment of a model on this much bigger dataset will be leveraged as a proxy on a lot more precise estimate of the generalization error of a model for this problem.

This additional dataset is not a test dataset. Technically, it is for the purposes of this demo, but we are pretending that this information is unavailable at model training time.

1 2 3 4 | # generate 2d classification dataset dataX, datay = make_blobs(n_samples=55000, centers=3, n_features=2, cluster_std=2, random_state=2) X, newX = dataX[:5000, :], dataX[5000:, :] y, newy = datay[:5000], datay[5000:] |

So now, we possess 5,000 instances to train our model and estimate its general performance. We also possess 50,000 instances that we can leverage to better approximate the true general performance of a singular model or an ensemble.

Next, we require a function to fit and assess a singular model on a training dataset and return the performance of the fit model on a test dataset. We also require the model that was fit so that we can leverage it as portion of an ensemble. The evaluate_model() function below implements this behaviour.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # evaluate a single mlp model def evaluate_model(trainX, trainy, testX, testy): # encode targets trainy_enc = to_categorical(trainy) testy_enc = to_categorical(testy) # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model model.fit(trainX, trainy_enc, epochs=50, verbose=0) # evaluate the model _, test_acc = model.evaluate(testX, testy_enc, verbose=0) return model, test_acc |

Then, we can develop random splits of the training dataset and fit and assess models on every split.

We can leverage the train_test_split() function from the scikit-learn library to develop a random split of a dataset into train and test sets. It takes the X and y arrays as arguments and the “test_size” mentions the size of the test dataset in terms of a percentage. We will leverage 10% of the 5,000 instances as the test.

We can then call the evaluate_model() to fit and assess a model. The returned precision and model can then be included to lists for later leveraging.

In this instance, we will restrict the number of splits, and in turn, the number of fit models to 10.

1 2 3 4 5 6 7 8 9 10 11 | # multiple train-test splits n_splits = 10 scores, members = list(), list() for _ in range(n_splits): # split data trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.10) # evaluate model model, test_acc = evaluate_model(trainX, trainy, testX, testy) print(‘>%.3f’ % test_acc) scores.append(test_acc) members.append(model) |

Upon fitment and assessment of these models, we can estimate the predicted performance of a provided model with the selected configuration for the domain.

# summarize expected performance

print(‘Estimated Accuracy %.3f (%.3f)’ % (mean(scores), std(scores)))

We are not aware how several of the models will be good within the ensemble. It is probable that there will be a point of diminishing returns, upon which the addition of subsequent members no longer alters the performance of the ensemble.

Nonetheless, we can assess differing ensemble sizes from 1 to 10 and plot their performance on the unobserved holdout dataset.

We can also assess every model on the holdout dataset and quantify the average of these scores to get a much improved approximation of the true performance of the selected model on the prediction problem.

1 2 3 4 5 6 7 8 9 | # evaluate different numbers of ensembles on hold out set single_scores, ensemble_scores = list(), list() for i in range(1, n_splits+1): ensemble_score = evaluate_n_members(members, i, newX, newy) newy_enc = to_categorical(newy) _, single_score = members[i-1].evaluate(newX, newy_enc, verbose=0) print(‘> %d: single=%.3f, ensemble=%.3f’ % (i, single_score, ensemble_score)) ensemble_scores.append(ensemble_score) single_scores.append(single_score) |

Lastly, we can contrast and calculate a more solid estimate of the general performance of an average model on the forecasting problem, then plot the performance of the ensemble size to precision on the holdout dataset.

1 2 3 4 5 6 | # plot score vs number of ensemble members print(‘Accuracy %.3f (%.3f)’ % (mean(single_scores), std(single_scores))) x_axis = [i for i in range(1, n_splits+1)] pyplot.plot(x_axis, single_scores, marker=’o’, linestyle=’None’) pyplot.plot(x_axis, ensemble_scores, marker=’o’) pyplot.show() |

Connecting all of this together, the full example is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | # random-splits mlp ensemble on blobs dataset from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot from numpy import mean from numpy import std import numpy from numpy import array from numpy import argmax
# evaluate a single mlp model def evaluate_model(trainX, trainy, testX, testy): # encode targets trainy_enc = to_categorical(trainy) testy_enc = to_categorical(testy) # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model model.fit(trainX, trainy_enc, epochs=50, verbose=0) # evaluate the model _, test_acc = model.evaluate(testX, testy_enc, verbose=0) return model, test_acc
# make an ensemble prediction for multi-class classification def ensemble_predictions(members, testX): # make predictions yhats = [model.predict(testX) for model in members] yhats = array(yhats) # sum across ensemble members summed = numpy.sum(yhats, axis=0) # argmax across classes result = argmax(summed, axis=1) return result
# evaluate a specific number of members in an ensemble def evaluate_n_members(members, n_members, testX, testy): # select a subset of members subset = members[:n_members] # make prediction yhat = ensemble_predictions(subset, testX) # calculate accuracy return accuracy_score(testy, yhat)
# generate 2d classification dataset dataX, datay = make_blobs(n_samples=55000, centers=3, n_features=2, cluster_std=2, random_state=2) X, newX = dataX[:5000, :], dataX[5000:, :] y, newy = datay[:5000], datay[5000:] # multiple train-test splits n_splits = 10 scores, members = list(), list() for _ in range(n_splits): # split data trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.10) # evaluate model model, test_acc = evaluate_model(trainX, trainy, testX, testy) print(‘>%.3f’ % test_acc) scores.append(test_acc) members.append(model) # summarize expected performance print(‘Estimated Accuracy %.3f (%.3f)’ % (mean(scores), std(scores))) # evaluate different numbers of ensembles on hold out set single_scores, ensemble_scores = list(), list() for i in range(1, n_splits+1): ensemble_score = evaluate_n_members(members, i, newX, newy) newy_enc = to_categorical(newy) _, single_score = members[i-1].evaluate(newX, newy_enc, verbose=0) print(‘> %d: single=%.3f, ensemble=%.3f’ % (i, single_score, ensemble_score)) ensemble_scores.append(ensemble_score) single_scores.append(single_score) # plot score vs number of ensemble members print(‘Accuracy %.3f (%.3f)’ % (mean(single_scores), std(single_scores))) x_axis = [i for i in range(1, n_splits+1)] pyplot.plot(x_axis, single_scores, marker=’o’, linestyle=’None’) pyplot.plot(x_axis, ensemble_scores, marker=’o’) pyplot.show() |

Running the instance first fits and assesses 10 models on 10 differing random splits of the dataset into train and test sets.

From these scores, the estimate is that the average model fit on the dataset will accomplish a precision of nearly 83% with a standard deviation of about 1.9%.

1 2 3 4 5 6 7 8 9 10 11 | >0.816 >0.836 >0.818 >0.806 >0.814 >0.824 >0.830 >0.848 >0.868 >0.858 Estimated Accuracy 0.832 (0.019) |

We can then assess the performance of every model on the unobserved dataset and the performance of ensembles of models from 1 to 10 models.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

From these scores, we can observe that a more precise estimate of the performance of an average model on this issue is approximately 82% and that the estimated performance is optimistic.

1 2 3 4 5 6 7 8 9 10 11 | > 1: single=0.821, ensemble=0.821 > 2: single=0.821, ensemble=0.820 > 3: single=0.820, ensemble=0.820 > 4: single=0.820, ensemble=0.821 > 5: single=0.821, ensemble=0.821 > 6: single=0.820, ensemble=0.821 > 7: single=0.820, ensemble=0.821 > 8: single=0.820, ensemble=0.821 > 9: single=0.820, ensemble=0.820 > 10: single=0.820, ensemble=0.821 Accuracy 0.820 (0.000) |

A lot of the variation amongst the precision scores is occurring in the fractions of percent.

A graph is developed demonstrating the precision of every individual model on the unseen holdout dataset as blue dots and the performance of an ensemble with a provided number of members from 1-10 as an orange line and dots.

We can observe that leveraging an ensemble of 4-to-8 members, at least in this scenario, has the outcome of precision that is better than a majority of the individual runs (orange lines is above several blue dots)

The graph does display some individual models can feature improved performance than an ensemble of models (blue dots above the orange line), but we are not able to select these models. Here, we illustrate that without extra data, (for example, the out of sample dataset) that an ensemble of 4-to-8 members will provide better on average performance than an arbitrarily chosen train-test model.

More repeats (for example, 30 or 100) might have the outcome of a more stable ensemble performance.

__Cross-Validation Ensemble__

An issue with repeated random splits as a resampling strategy for estimating the average performance of a model is that it is optimistic.

A strategy developed to be less optimistic and is broadly leveraged as an outcome is the k-fold cross-validation method.

The method has reduced bias as in every example in the dataset is only leveraged one time in the test dataset to estimate model performance, unlike, random-train-test splits where a provided instance might be leverage to assess a model several times.

The procedure has a singular parameter as each instance in the dataset is just leveraged a single time in the test dataset to estimate model performance, unlike train-test splits where a provided instance might be leveraged to assess a model several times.

The process has a singular parameter referred to as k that references to the number of groups that a provided data sample is to be split into. The average of the scores of every model furnishes a less biased estimation of model performance. A usual value for k is 10.

As neural network models are computationally very expensive to train, it is typical to leverage the best performing model during cross-validation as the final model.

Alternatively, the outcome models from the cross-validation procedure can be brought together to furnish a cross-validation ensemble that is probable to have improved performance on average than a provided singular model.

We can leverage the KFold class from scitkit-learn to split the dataset into k folds. It takes as arguments the number of splits, if or not to shuffle the sample, and the seed for the pseudorandom number generator leveraged before the shuffle.

1 2 3 | # prepare the k-fold cross-validation configuration n_folds = 10 kfold = KFold(n_folds, True, 1) |

After the class is instantiated, it can be enumerated to obtain every split of indexes into the dataset for the train and test sets.

1 2 3 4 5 6 7 8 9 10 11 | # cross validation estimation of performance scores, members = list(), list() for train_ix, test_ix in kfold.split(X): # select samples trainX, trainy = X[train_ix], y[train_ix] testX, testy = X[test_ix], y[test_ix] # evaluate model model, test_acc = evaluate_model(trainX, trainy, testX, testy) print(‘>%.3f’ % test_acc) scores.append(test_acc) members.append(model) |

After the scores are quantified on every fold, the average of the scores can be leveraged to report the predicted performance of the approach.

1 2 | # summarize expected performance print(‘Estimated Accuracy %.3f (%.3f)’ % (mean(scores), std(scores))) |

Now that we have gathered the 10 models assessed on the 10 folds, we can leverage them to develop a cross-validation ensemble. It appears intuitive to leverage all 10 models within the ensemble, nonetheless, we can assess the precision of every subset of ensembles from 1 to 10 members as we did in the prior section.

The complete instance of undertaking analysis of the cross-validation ensemble is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | # cross-validation mlp ensemble on blobs dataset from sklearn.datasets import make_blobs from sklearn.model_selection import KFold from sklearn.metrics import accuracy_score from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot from numpy import mean from numpy import std import numpy from numpy import array from numpy import argmax
# evaluate a single mlp model def evaluate_model(trainX, trainy, testX, testy): # encode targets trainy_enc = to_categorical(trainy) testy_enc = to_categorical(testy) # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model model.fit(trainX, trainy_enc, epochs=50, verbose=0) # evaluate the model _, test_acc = model.evaluate(testX, testy_enc, verbose=0) return model, test_acc
# make an ensemble prediction for multi-class classification def ensemble_predictions(members, testX): # make predictions yhats = [model.predict(testX) for model in members] yhats = array(yhats) # sum across ensemble members summed = numpy.sum(yhats, axis=0) # argmax across classes result = argmax(summed, axis=1) return result
# evaluate a specific number of members in an ensemble def evaluate_n_members(members, n_members, testX, testy): # select a subset of members subset = members[:n_members] # make prediction yhat = ensemble_predictions(subset, testX) # calculate accuracy return accuracy_score(testy, yhat)
# generate 2d classification dataset dataX, datay = make_blobs(n_samples=55000, centers=3, n_features=2, cluster_std=2, random_state=2) X, newX = dataX[:5000, :], dataX[5000:, :] y, newy = datay[:5000], datay[5000:] # prepare the k-fold cross-validation configuration n_folds = 10 kfold = KFold(n_folds, True, 1) # cross validation estimation of performance scores, members = list(), list() for train_ix, test_ix in kfold.split(X): # select samples trainX, trainy = X[train_ix], y[train_ix] testX, testy = X[test_ix], y[test_ix] # evaluate model model, test_acc = evaluate_model(trainX, trainy, testX, testy) print(‘>%.3f’ % test_acc) scores.append(test_acc) members.append(model) # summarize expected performance print(‘Estimated Accuracy %.3f (%.3f)’ % (mean(scores), std(scores))) # evaluate different numbers of ensembles on hold out set single_scores, ensemble_scores = list(), list() for i in range(1, n_folds+1): ensemble_score = evaluate_n_members(members, i, newX, newy) newy_enc = to_categorical(newy) _, single_score = members[i-1].evaluate(newX, newy_enc, verbose=0) print(‘> %d: single=%.3f, ensemble=%.3f’ % (i, single_score, ensemble_score)) ensemble_scores.append(ensemble_score) single_scores.append(single_score) # plot score vs number of ensemble members print(‘Accuracy %.3f (%.3f)’ % (mean(single_scores), std(single_scores))) x_axis = [i for i in range(1, n_folds+1)] pyplot.plot(x_axis, single_scores, marker=’o’, linestyle=’None’) pyplot.plot(x_axis, ensemble_scores, marker=’o’) pyplot.show() |

Running the instance first prints the performance of every one of the ten models on each of the holds of the cross-validation.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

The average performance of these models is reported as approximately 82%, which seems to be less optimistic than the random-splits approach leveraged in the prior section.

1 2 3 4 5 6 7 8 9 10 11 | >0.834 >0.870 >0.818 >0.806 >0.836 >0.804 >0.820 >0.830 >0.828 >0.822 Estimated Accuracy 0.827 (0.018) |

Then, each of the saved models is assessed on the unseen holdout set.

The average of these scores is also approximately 82%, highlighting that, at least in this scenario, the cross-validation estimation of the general performance of the model was adequate.

1 2 3 4 5 6 7 8 9 10 11 | > 1: single=0.819, ensemble=0.819 > 2: single=0.820, ensemble=0.820 > 3: single=0.820, ensemble=0.820 > 4: single=0.821, ensemble=0.821 > 5: single=0.820, ensemble=0.821 > 6: single=0.821, ensemble=0.821 > 7: single=0.820, ensemble=0.820 > 8: single=0.819, ensemble=0.821 > 9: single=0.820, ensemble=0.821 > 10: single=0.820, ensemble=0.821 Accuracy 0.820 (0.001) |

A graph of singular model precision (blue dots) and ensemble size vs precision (orange line) is developed.

As in the prior instance, the real difference amongst the performance of the models is in the fractions of percent in model precision.

The orange line demonstrates that as the number of members increases, the precision of the ensemble increases to a point of diminishing returns.

We can observe that, at least in this scenario, leveraging four or more of the models fit during cross-validation in an ensemble provides improved performance than nearly all individual models.

We can also observe that a default strategy of leveraging all models in the ensemble would be effective.

__Bagging Ensemble__

A restriction of random splits and k-fold cross-validation from the viewpoint of ensemble learning is that the models are really similar.

The bootstrap strategy is a statistical strategy for estimation of quantities about a population as averaging estimates from several small data samples.

Critically, samples are built by drawing observations from a big data sample one at a time and returning them to the data sample following their selection. This enables a provided observation to be included in a provided small sample more than once. This strategy to sampling is referred to as sampling with replacement.

The method can be leveraged to estimate the performance of neural network models. Instances not chosen in a provided sample can be leveraged as a test set to estimate the performance of the model.

The bootstrap is a solid method for estimating model performance. It is a little bit impacted from an optimistic bias, but is typically almost as precise as k-fold-cross-validation in practice.

The advantage for ensemble learning is that every model is that every data sample is biased, enabling a provided instance to appear several times in the sample. This, in turn, implies that the models that received training on those samples are bound to be biased, critically in differing ways. The outcome can be ensemble predictions that can be more precise.

Generally, leveraging of the bootstrap strategy in ensemble learning is referenced to as bootstrap aggregation or bagging.

We can leverage the resample() function from scikit-learn to choose a subsample with replacement. The function takes an array to subsample and the size of the resample as arguments. We will carry out the selection in rows indices that we can in turn leverage to choose rows in the X and Y arrays.

The size of the sample will be 4,500 or 9/10ths of the data, even though the test set may be bigger than 10% as provided the use of resampling, more than 500 instances may have been left unchosen.

# multiple train-test splits

n_splits = 10

scores, members = list(), list()

for _ in range(n_splits):

# select indexes

ix = [i for i in range(len(X))]

train_ix = resample(ix, replace=True, n_samples=4500)

test_ix = [x for x in ix if x not in train_ix]

# select data

trainX, trainy = X[train_ix], y[train_ix]

testX, testy = X[test_ix], y[test_ix]

# evaluate model

model, test_acc = evaluate_model(trainX, trainy, testX, testy)

print(‘>%.3f’ % test_acc)

scores.append(test_acc)

members.append(model)

It is typical to leverage simple overfit models like unpruned decision trees when leveraging a bagging ensemble learning technique.

Improved performance might be observed with over-constrained and overfit neural networks. Nonetheless, we will leverage the same MLP from prior sections in this instance.

Also, it is typical to continue to include ensemble members in bagging till the performance of the ensemble plateaus, as bagging does not overfit the dataset. We will again restrict the number of members to 10 as in prior instances.

The total example of bootstrap aggregations for estimating model performance and ensemble learning with a Multilayer Perceptron is detailed here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | # bagging mlp ensemble on blobs dataset from sklearn.datasets import make_blobs from sklearn.utils import resample from sklearn.metrics import accuracy_score from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from matplotlib import pyplot from numpy import mean from numpy import std import numpy from numpy import array from numpy import argmax
# evaluate a single mlp model def evaluate_model(trainX, trainy, testX, testy): # encode targets trainy_enc = to_categorical(trainy) testy_enc = to_categorical(testy) # define model model = Sequential() model.add(Dense(50, input_dim=2, activation=’relu’)) model.add(Dense(3, activation=’softmax’)) model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]) # fit model model.fit(trainX, trainy_enc, epochs=50, verbose=0) # evaluate the model _, test_acc = model.evaluate(testX, testy_enc, verbose=0) return model, test_acc
# make an ensemble prediction for multi-class classification def ensemble_predictions(members, testX): # make predictions yhats = [model.predict(testX) for model in members] yhats = array(yhats) # sum across ensemble members summed = numpy.sum(yhats, axis=0) # argmax across classes result = argmax(summed, axis=1) return result
# evaluate a specific number of members in an ensemble def evaluate_n_members(members, n_members, testX, testy): # select a subset of members subset = members[:n_members] # make prediction yhat = ensemble_predictions(subset, testX) # calculate accuracy return accuracy_score(testy, yhat)
# generate 2d classification dataset dataX, datay = make_blobs(n_samples=55000, centers=3, n_features=2, cluster_std=2, random_state=2) X, newX = dataX[:5000, :], dataX[5000:, :] y, newy = datay[:5000], datay[5000:] # multiple train-test splits n_splits = 10 scores, members = list(), list() for _ in range(n_splits): # select indexes ix = [i for i in range(len(X))] train_ix = resample(ix, replace=True, n_samples=4500) test_ix = [x for x in ix if x not in train_ix] # select data trainX, trainy = X[train_ix], y[train_ix] testX, testy = X[test_ix], y[test_ix] # evaluate model model, test_acc = evaluate_model(trainX, trainy, testX, testy) print(‘>%.3f’ % test_acc) scores.append(test_acc) members.append(model) # summarize expected performance print(‘Estimated Accuracy %.3f (%.3f)’ % (mean(scores), std(scores))) # evaluate different numbers of ensembles on hold out set single_scores, ensemble_scores = list(), list() for i in range(1, n_splits+1): ensemble_score = evaluate_n_members(members, i, newX, newy) newy_enc = to_categorical(newy) _, single_score = members[i-1].evaluate(newX, newy_enc, verbose=0) print(‘> %d: single=%.3f, ensemble=%.3f’ % (i, single_score, ensemble_score)) ensemble_scores.append(ensemble_score) single_scores.append(single_score) # plot score vs number of ensemble members print(‘Accuracy %.3f (%.3f)’ % (mean(single_scores), std(single_scores))) x_axis = [i for i in range(1, n_splits+1)] pyplot.plot(x_axis, single_scores, marker=’o’, linestyle=’None’) pyplot.plot(x_axis, ensemble_scores, marker=’o’) pyplot.show() |

Running the instance prints the model performance on the unused instances for every bootstrap sample.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

We can observe that, in this scenario, the predicted performance of the model is less optimistic than random train-test splits and is probably quite like the finding for k-fold cross validation.

1 2 3 4 5 6 7 8 9 10 11 | >0.829 >0.820 >0.830 >0.821 >0.831 >0.820 >0.834 >0.815 >0.829 >0.827 Estimated Accuracy 0.825 (0.006) |

Probably due to the bootstrap sampling procedure, we observe that the actual performance of every model is a tad bit worse on the much bigger unseen holdout dataset.

This is to be expected provided the bias introduced by the sampling with substitution of the bootstrap.

1 2 3 4 5 6 7 8 9 10 11 | > 1: single=0.819, ensemble=0.819 > 2: single=0.818, ensemble=0.820 > 3: single=0.820, ensemble=0.820 > 4: single=0.818, ensemble=0.821 > 5: single=0.819, ensemble=0.820 > 6: single=0.820, ensemble=0.820 > 7: single=0.820, ensemble=0.820 > 8: single=0.819, ensemble=0.820 > 9: single=0.820, ensemble=0.820 > 10: single=0.819, ensemble=0.820 Accuracy 0.819 (0.001) |

The developed line plot is encouraging.

We observe that after nearly 4 members that the bagged ensemble accomplishes improved performance on the holdout dataset than any individual model. Without a doubt, provide the slightly lower average performance of individual models.

__Extensions__

This part of the blog post lists some ideas for expanding the guide with topics that you might desire to look into.

- Single model: Contrast the performance of every ensemble to a single model trained all available information.
- CV Ensemble Size: Experiment with bigger and smaller ensemble sizes for the cross-validation ensemble and contrast their performance.
- Bagging Ensemble Limit: Increase the number of members in the bagging ensemble to identify the point of diminishing returns.

__Further Reading__

This portion of the blog furnishes additional resources on the subject if you are delving to go deeper.

*Papers*

Neural Network Ensembles, Cross Validation, and Active Learning, 1995

*API*

Getting started with the Keras Sequential model

Keras Core Layers API

scipy.stats.mode API

numpy.argmax API

sklearn.datasets.make_blobs API

sklearn.model_selection.train_test_split API

sklearn.model_selection. KFold API

sklearn.utils.resample API

__Conclusion__

In this guide, you found out how to produce a suite of differing resampling-based ensembles for deep learning neural network models.

Particularly, you learned:

- How to estimate model performance leveraging random-splits and produce an ensemble from the models
- How to predict performance leveraging 10-fold cross-validation and generate a cross-validation ensemble.
- How to estimate performance leveraging the bootstrap and bring together models leveraging a bagging ensemble.