Essence of bootstrap aggregation ensembles
Bootstrap aggregation, or bagging, is a famous ensemble strategy that fits a decision tree on differing bootstrap samples of the training dataset.
It is easy to implement and efficient on a broad array of issues, and critically, modest extensions to the strategy have the outcome of ensemble methods that are among a few of the most capable strategies, like random forest, that perform well on a broad range of predictive modelling issues.
As such, we can generalize the bagging method to a framework for ensemble learning and compare and contrast a suite of common ensemble strategies that come from the “bagging family” of strategies. We can additionally leverage this framework to look into further extensions and how the strategy can be further tailored to a project dataset or selected predictive model.
In this guide, you will find out about the essence of the bootstrap aggregation approach to machine learning ensembles.
Upon going through this guide, you will be aware of:
- The bagging ensemble strategy for machine learning leveraging bootstrap sample and decision trees.
- How to distil the basic elements from the bagging strategy and how popular extensions such as random forest are directly connected to bagging.
- How to devise fresh extensions to bagging by choosing new procedures for the basic elements of the method.
Tutorial Summarization
This tutorial is subdivided into four portions, which are:
1] Bootstrap aggregation
2] Essence of Bagging Ensembles
3] Bagging Ensemble Family
- Random Subspace Ensemble
- Random Forest Ensemble
- Extra Trees Ensemble
4] Customized Bagging Ensembles
Bootstrap Aggregation
Bootstrap Aggregation, or bagging for short, is an ensemble machine learning algorithm.
This strategies consist of creating a bootstrap sample of the training dataset for every ensemble member and training a decision tree model on every sample, then bringing together the forecasts directly leveraging a statistic such as the average of the predictions.
Breiman’s bagging (short for Bootstrap Aggregation) algorithm is one of the earliest and simplest, however effective, ensemble-based algorithms.
The sample of the training dataset is developed leveraging the bootstrap method, which consists of choosing instances randomly with replacement.
Replacement implies that the same instance is metaphorically returned to the pool of candidate rows and might be chosen again or several times in any singular sample of the training dataset. It is also doable that some instances within the training dataset are not chosen at all for some bootstrap samples.
Some original examples prop up more than once, while a few original examples are not present in the sample.
The bootstrap method has the desired impact of making every sample of the dataset quite differently, or usefully different for developing an ensemble.
A decision tree is then fitted on every sample of data. Every tree will be a little differing provided the differences in the training dataset. Usually, the decision tree is setup to have probably an increased depth or to not leverage pruning. This can make every tree more specialized to the training dataset and, in turn, further increase the variations between the trees.
Differences within trees are desirable as they will enhance the “diversity” of the ensemble, which means generate ensemble members that possess a lower correlation in their forecast or prediction errors. It is generally accepted that ensembles consisted of ensemble members that are skilful and diverse (skilful in differing fashions or make differing errors) have better performance.
The diversity within the ensemble is ensured by the variations within the bootstrapped replicas on which every classifier is trained, in addition by leveraging a comparatively weak classifier whose decision boundaries measurably vary with regard to comparatively small perturbations within the training data.
An advantage of bagging is that it typically does not overfit the training dataset, and the number of ensemble members can go on to be increased till performance on a holdout dataset ceases improving.
This is a high-level summarization of the bagging ensemble, however we can generalize the strategy and extract the basic elements.
Essence of Bagging Ensembles
The essence of bagging is with regards to leveraging independent models.
In this fashion, it might be the closest realization of the “wisdom of the crowd” metaphor, particularly if we consider that performance goes on to improve with the addition of independent contributors.
Unluckily, we cannot generate truly independent models as we just have one training dataset. Rather, the bagging approach approximates independent models leveraging randomness. Particularly, through leveraging randomness in the sampling of the dataset leveraged to train every model, forcing some semi-independence amongst the models.
Although it is virtually impossible to get really independent base learners as they are produced from the same training data set, base learners with reduced dependence can be gathered by introducing randomness within the learning process, and a good generalization capability can be expected by the ensemble.
The structure of the bagging process can be subdivided into three basic elements, which are:
- Differing training datasets: Develop a differing sample of the training dataset for every ensemble model.
- High-variance models: Train the same high-variance model on every sample of the training dataset.
- Average Predictions: Leverage statistics to bring together predictions.
We can map the canonical bagging strategy onto these elements as follows:
- Differing Training Datasets: Bootstrap sample
- High-variance models: Decision tree
- Average Predictions: Mean for regression, mode for classification
This furnishes a framework where we could take up alternative methods for every essential element of the model.
For instance, we could alter the algorithm to another high-variance strategy that was somewhat unstable learning behaviour, perhaps like k-nearest neighbours with a modest value for the k hyperparameter.
Typically, bagging generates a combined model that outpaces the model that is developed leveraging a single instance of the original data […] this is the case particularly for unstable inducers as bagging can eradicate their instability. In this context, an inducer is viewed as unstable if perturbations in the learning set can generate considerable changes in the constructed classifier.
We might also alter the sampling strategy from the bootstrap to another sampling strategy, or more generally, a differing strategy entirely. As a matter of fact, this is a basis for several of the extensions of bagging detailed in the literature. Particularly, to attempt to obtain ensemble members that are more independent, yet stay skilful.
We are aware that the combination of independent base learners will cause dramatic reduction of errors and thus, we wish to get base learners as independent as possible.
Let’s take a deeper look at other ensemble methods that might be viewed as a part of the bagging family.
Bagging Ensemble Family
Several ensemble machine learning strategies might be taken up as descendants of bagging.
As such, we can map them onto our framework of essential bagging. This is a beneficial exercise as it both highlights the variations amongst methods and the uniqueness of every technique. Probably more critically, it could also ignite ideas for additional variations that you may desire to explore on your own predictive modelling project.
Let’s take a deeper look at three of the more popular ensemble strategies connected to bagging.
Random Subspace Ensemble
The random subspace method, or random subspace ensemble, consists of choosing random subsets of the features (columns) within the training dataset for every ensemble member.
Every training dataset has all rows as it is only the columns that are randomly sampled.
Differing Training Datasets: Randomly sample columns.
Random Forest Ensemble
The random forest strategy is probably one of the most successful and broadly leveraged ensemble strategies, provided its ease of implementation and usually superior performance on a broad array of predictive modelling problems.
The method often consists of choosing a bootstrap of the training dataset and a small random subset of columns to take up when selecting every split point in each ensemble member.
In this fashion, it is like a combo of bagging with the random subspace method, even though the random subspaces are leveraged uniquely for the way decision trees are built.
- Differing Training Datasets: Bootstrap sample
- High-variance model: Decision tree with split points on random subsets of columns.
Extra Tress Ensemble
The extra trees ensemble leverages the entire training dataset, even though it configures the decision tree algorithm to choose the split points at random.
- Differing training datasets: Whole dataset.
- High-variance model: Decision tree with random split points.
Customized Bagging Ensembles
We have briefly reviewed the canonical random subspace, random forest, and additional trees methods, even though there is no reason that the methods could not share additional implementation details.
As a matter of fact, sophisticated implementations of algorithms such as bagging and random forest proved adequate configuration to bring together several of these features.
Instead of exhausting the literature, we can come up with our own extensions that map into the bagging framework. This may compel you to look into a less common method or devise your own bagging strategy targeted on your dataset or selection of model.
There are probably tens or hundreds of extensions of bagging with minimal modifications to the fashion in which the training dataset for every ensemble member is prepared or the particulars of how the model is developed from the training dataset.
The changes are developed around the three primary elements of the essential bagging method and typically seek improved performance by looking into the balance between skilful-enough ensemble members while maintaining enough diversity amongst predictions or prediction errors.
For instance, we could alter the sampling of the training dataset to be a random sample with no replacement, rather than a bootstrap sample. This is referenced to as “pasting”.
- Differing training dataset: Random subsample of rows.
We could go further and choose a random subsample of rows (such as pasting) and a random subsample of columns (random subsample) for every decision tree. This is referred to as “random patches”.
- Differing training dataset: Random subsample of rows and columns.
We can also take our own simple extensions of the concept.
For instance, it is typical to leverage feature selection strategies to select a subset of input variable in order to minimize the complexity of a prediction problem (lesser columns) and accomplish improved performance (reduced noise). We could visualize a bagging ensemble where every model is fitted on a differing “view” of the training dataset chosen by a differing feature selection of feature importance method.
- Differing training dataset: Columns selected by differing feature selection methods.
It is also typical to a test a model with several differing data transforms as portion of a modelling pipeline. This is performed as we cannot know prior which representation of the training dataset will ideally expose the unknown underlying structure of the dataset to the learning algorithms. We could visualize a bagging ensemble where every model is fitted on a differing transform of the training dataset.
- Differing training dataset: Data transforms of the raw training dataset.
These are a few probably overt instances of how the essence of the bagging strategy can be looked into, hopefully igniting further ideas.