Essence of stacking ensembles for machine learning
Stacked generalization, or stacking, might be a less widespread machine learning ensemble provided that it details a framework more than a particular model.
Probably the reason it has been less popular within mainstream machine learning is that it can be complicated to train a stacking model in the right way, without suffering data leakage. This has implied that the strategy has primarily been leveraged by very skilled specialists in high-stakes environments, like machine learning competitions, and provided new names like blending ensembles.
Nonetheless, modern machine learning frameworks make stacking routine to implement and assess for classification and regression predictive modelling problems. As such, we can review ensemble learning methods connected to stacking via the lens of the stacking framework. This wider family of stacking strategies can also assist to observe how to tailor the configuration of the strategy in the future when looking into our own predictive modelling projects.
In this guide, you will find out about the essence of the stacked generalization strategy to machine learning ensembles.
After going through this guide, you will be aware of:
- The stacking ensemble strategy for machine learning leverages a meta-model to combine forecasts from contributing members
- How to distil the basic elements from the stacking method and how widespread extensions such as blending and the super ensemble are connected
- How to devise new extensions to stacking by choosing new procedures for the basic elements of the method
This tutorial is subdivided into four portions, which are:
1] Stacked generalization
2] Essence of stacking ensembles
3] Stacking ensemble family
- Voting Ensembles
- Weighted Average
- Blending Ensemble
- Super Learner Ensemble
4] Customized Stacking Ensembles
Stacked generalization or stacking for short, is an ensemble machine learning algorithm.
Stacking consists of leveraging a machine learning model to learn how to ideally combine the forecasts from contributing ensemble members.
In voting, ensemble membenars are usually a diverse collection of model variants, like a decision tree, naïve Bayes, and support vector machine. Forecasts are made by averaging the predictions, like choosing the class with the most votes (the statistical mode) or the biggest summed probability.
Unweighted voting only makes logical sense if the learning schemes perform comparably well.
An extension to voting is to weigh the contribution of every ensemble member in the prediction, furnishing a weighted sum prediction. This enables more weight to be placed on models that perform better on average and less on those that don’t feature as good performance but still possess some predictive skill.
The weight allocated to every contributing member must be learned, like the performance of every model on the training dataset or a holdout dataset.
Stacking generalizes this strategy and enables any machine learning model to be leveraged to learn how to ideally combine the predictions from contributing members. The model that brings together the predictions is referenced to as the meta-model, while the ensemble members are referenced to as base-models.
The issue with voting is that it is not clear which classifier to trust. Stacking attempts to learn which classifiers are the reliable ones, leveraging another learning algorithm; the metalearner; to discover how best to bring together the output of the base learners.
In the language obtained from the paper that put forth this strategy, base models are referenced to as level-0 learners, and the meta-model is referenced to as a level-1 model.
Naturally, the stacking of models can continue to any desirable level.
Stacking is a general process where a learner is trained to combine the individual learners. Here, the individual learners are referred to as the first-level learners, while the combiner is referred to as the second-level learner, or meta-learner.
Critically, the way that the meta-model is trained is different to the way the base models are trained.
The input to the meta-model are the predictions made by the base-models, not the raw inputs from the dataset. The target is the same expected target value. The predictions made by the base-models leveraged to train the meta-model are for instances not leveraged to train the base models, implying that they are out of sample.
For instance, the dataset can be split into train, validation and test datasets. Every base-model can then be fitted on the training set and make predictions on the validation dataset. The predictions from the validation set are then leveraged to train the meta-model.
This implies that the meta-model is trained to ideally combine the capabilities of the base-models when they are making out-of-sample predictions, for example, instances not observed during training.
We reserve some instances to form the training information for the level-1 learner and build level-0 classifier from the remaining data. After the level-0 classifiers have been constructed are leveraged to classify the examples in the holdout set, forming the level-1 training data.
After the meta-model is trained, the base models can be re-trained on the combined learning and validation datasets. The entire system can then be assessed on the test set by passing instances first through the base models to gather base-level predictions, then passing those predictions through the meta-model to get final predictions. The system can be leveraged in the same way when making forecasts on new data.
This strategy to training, evaluating, and leveraging a stacking model can be further generalized to work with k-fold cross-validation.
Usually, base models are prepped leveraging differing algorithms, meaning that the ensembles are a heterogenous collection of model types furnishing a desired level of diversity to the forecasts made. Although, this does not need to be the scenario, and differing configurations of the same models can be leveraged or the same model trained on differing datasets.
The first-level learners are typically produced through application of differing learning algorithms, and so, stacked ensembles are often heterogenous.
On classification problems, the stacking ensemble often has better performance when base-models are setup to forecast probabilities rather than crisp class labels, as the included uncertainty in the predictions furnishes additional context for the meta model when learning how to ideally combine the predictions.
A majority of learning schemes are able to output probabilities for each class label rather than making a singular categorical forecast. This can be exploited to enhance the performance of stacking by leveraging the odds to form the level-1 data.
The meta-model is usually a simplistic linear model, like a linear regression for regression problems or a logistic regression model for classification. Also, this does not have to be the scenario, and any machine learning model can be leveraged as the meta learner.
As a majority of the work is already performed by the level-0 learners, the level-1 classifier is essentially just an arbiter and it makes sense to select a rather simple algorithm for this reason […] Simple linear models or trees with linear models at the leaves typically function well.
This is a high-level summarization of the stacking ensemble method, however we can go about generalizing the strategy and extract the basic elements.
Essence of Stacking Ensembles
The essence of stacking is with regards to learning how to combine contributing ensemble members.
In this fashion, we might think of stacking as the assumption that a simple “wisdom of crowds” (e.g. averaging) is good but not optimum and that improved outcomes can be accomplished if we can detect and provide them additional weight to specialists in the crowd.
The specialists and lesser experts are detected on the basis of their skill in new scenarios, e.g. out-of-sample data. This is a critical distinction from simple averaging and voting, even though it puts forth a level of intricacy that makes the strategy a challenge to implement rightly and prevent data leakage, and in turn, wrong and optimistic performance.
Nonetheless, we can observe that stacking is a really general ensemble learning strategy.
Widely conceived, we might think of a weighted average of ensemble models as a generalization and enhancement upon voting ensembles, and stacking as a further generalization of a weighted average model.
As such, the structure of the stacking process can be divided into three basic factors, which are:
- Diverse ensemble members: Develop a diverse set of models that make differing predictions.
- Member assessment: Assess the performance of ensemble members.
- Combine with model: Leverage a model to combine predictions from members.
We can go about mapping canonical stacking onto these factors as follows:
- Diverse Ensemble members: Leverage differing algorithms to fit every contributing model.
- Member Assessment: Assess model performance on out-of-sample predictions.
- Combine with model: Machine learning model to combine forecasts.
This furnishes a framework where we should consider connected ensemble algorithms.
Let’s take a deeper look at other ensemble strategies that might be viewed as a part of the stacking family.
Stacking Ensemble Family
Several ensemble machine learning strategies might be considered precursors or descendants of stacking.
As such, we can map them onto our framework of essential stacking. This is a beneficial exercise as it both highlights the variations between strategies and uniqueness of every strategy. Probably more critically, it might also spark ideas for extra variations that you might want to explore on your own predictive modelling project.
Let’s take a deeper look at other ensemble methods that might be viewed as part of the stacking family.
Voting ensemble are one of easiest ensemble learning strategies.
A voting ensemble usually consists of leveraging a differing algorithm to prep every ensemble member, much like stacking. Rather than learning how to bring together predictions, a simplistic statistic is leveraged.
On regression issues, a voting ensemble might forecast the mean or median of the forecasts from ensemble members. For classification issues, the label with the majority of votes is forecasted, referred to as hard voting, or the label that obtained the biggest sum probability is forecasted, referred to as soft voting.
The critical difference from stacking is that there is no weighting of models on the basis of their performance. All models are assumed to possess the same skill level on average.
Member assessment: Assuming all models are equally skilful.
Combine with model: Simple statistics.
Weighted Average Ensemble
A weighted average might be taken up one step above a voting ensemble.
Like stacking and voting ensembles, a weighted average leverages a diverse grouping of model variants as a contributing members.
Not like voting, a weighted average goes by the assumption that a few contributing members are better than other and weights contributions from models accordingly.
The easiest weighted average ensemble weights every model on the basis of its performance within a training dataset. An enhancement over this naïve approach is to weigh every member on the basis of its performance on a hold-out dataset, like a validation set or out-of-fold forecasts during k-fold cross-validation.
One step beyond might consist of tuning the coefficient weightings for every model leveraging an optimization algorithm and performance on a holdout dataset.
These ongoing improvements of a weighted average model start to resemble a primitive stacking model with a linear model trained to bring together the predictions.
- Member assessment: Member performance on training dataset.
- Combine with model: Machine learning model to combine predictions.
This furnishes a framework where we would take up connected ensemble algorithms.
Let’s take a deeper look at other ensemble strategies that might be viewed a part of the stacking family.
Stacking Ensemble Family
Several ensemble machine learning strategies might be viewed as precursors or descendants of stacking.
As such, we can map them onto our framework of basic stacking. This is a beneficial exercise as it both highlights the variations amongst methods and uniqueness of every technique. Probably more critically, it might also spur ideas for extra variations that you might wish to explore on your own predictive modelling project.
Let’s take a deeper look at four of the typical ensemble strategies connected to stacking.
Voting Ensembles are one of the easiest ensemble learning strategies.
A voting ensemble usually consists of leveraging a differing algorithm to prep every ensemble member, a lot like stacking. Rather than learning how to combine predictions, a simplistic statistic is leveraged.
On regression issues, a voting ensemble may forecast the mean or median of the forecasts from ensemble members. For classification issues, the label with the most votes is forecasted, referred to as hard voting, or the label that received the biggest sum probability is forecasted, referred to as soft voting.
The critical difference from stacking is that there is no weighting of models on the basis of their performance. We assume that all models to possess the same skill level on average.
- Member assessment: assume all models are equally skilful
- Combine with model: Simplistic statistics
Weighted Average Ensemble
A weighted average may be viewed as one step above a voting ensemble.
Like stacking and voting ensembles, a weighted average leverages a diverse collection of model types as contributing members.
Unlike voting, as weighted average goes by the assumption that some contributing members are better than others and weighs contributions from models.
The simplest weighted average ensemble weighs every model on the basis of its performance on a training dataset. An enhancement over this naïve approach is to weigh every member on the basis of its performance on a hold-out dataset, like a validation set or out-of-fold predictions during k-fold cross-validation.
One step beyond might consist of tuning the coefficient weightings for every model leveraging an optimization algorithm and performance within a holdout dataset.
These continued enhancements of a weighted average model started to resemble a primitive stacking model possessing a linear model which receives training to bring together the predictions.
- Member assessment: Member performance on a training dataset
- Combine with model: Weighted average of predictions
Blending is overtly a stacked generalization model with a particular configuration.
A restriction of stacking is that there is no typically accepted configuration. This can make the method a challenge for starters as basically any models can be leveraged as the base-models and meta-model and any resampling method can be leveraged to prep the training dataset for the meta-model.
Blending is a particular stacking ensemble that makes two prescriptions.
The first is to leverage a holdout variation dataset to prep the out-of-sample forecasts leveraged to train the meta-model. The second is to leverage a linear model as the meta-model.
The strategy was born out of the necessities of practitioners operating on machine learning competitions that consist of the development of a very big number of base learner models, probably from differing sources (or units of people) that in turn might be too computationally costly and too much of a challenge to coordinate to validate leveraging the k-fold cross-validation partitions of the dataset.
Member predictions: Out-of-sample predictions on a validation dataset.
Combine with model: Linear model (for example, linear regression or logistic regression)
Provided the proliferation of blending ensembles, stacking has at times come to particularly reference to the leveraging of k-fold cross validation to prep out of sample predictions for the meta-model.
Super Learner Ensemble
Like blending, the super ensemble is a particular configuration of a stacking ensemble.
The meta-model in super learning is prepped leveraging out-of-fold predictions for base learners gathered during k-fold cross validation.
As such, we might view the super learner ensemble as a sibling to blending where the primary difference is the choice of how out-of-sample forecasts are prepped for the meta learner.
- Diverse ensemble members: Leveraging differing algorithms and differing configurations of the similar algorithms.
- Member Assessment: Out of fold predictions on k-fold cross-validation
Customized Stacking Ensembles
We have undertaken review of canonical stacking as a framework for bringing together the predictions from a diverse array of model types.
Stacking is a broad strategy, which can make it difficult to begin leveraging. We can observe how voting ensembles and weighted average ensembles are a simplification of the stacking method and blending ensembles and the super learner ensembles are a particular configuration of stacking.
This review demonstrated that the concentration on differing stacking strategies is on the advancement of the meta-model, like leveraging statistics, a weighted average, or a true machine learning model. The concentration has also been on the manner in which the meta-model is trained, for example, out of sample forecasts from a validation dataset or k-fold cross validation.
An alternative area to look into with stacking may be the diversity of the ensemble members beyond merely leveraging differing algorithms.
Stacking is not prescriptive in the variants of models contrasted to boosting and bagging that both prescribe leveraging decision trees. This enables for a ton of flexibility in customizing and looking into the use of the strategy on a dataset.
For instance, we could imagine fitting a big number of decision trees on bootstrap samples of the training dataset, as we perform in bagging, then evaluating a suite of differing models to learn how to ideally combine the forecasts from the trees.
Diverse Ensemble Members: Decision trees trained on bootstrap samples.
Alternatively, we can visualize a grid looking for a large number of configurations for a singular machine learning model, which is typical on a machine learning project, and keeping all of the fit models. These models could then be leveraged as members in a stacking ensemble.
Diverse Ensemble Members: Alternative configurations of the similar algorithm.
We might also observe the mixture of experts strategy as fitting into the stacking strategy.
Mixture of experts, or MoE in short, is a strategy that overtly partitions an issue into subproblems and trains a model on every subproblem, then leverages the model to learn how to best weigh or bring together the forecasts from specialists.
The critical differences amongst stacking and mixture of experts are the overtly divide and conquer approach of MoE and the more complicated fashion in which predictions are brought together for leveraging a gating network.
Nonetheless, we visualize partitioning an input feature space into a grid of subspaces, training a model on every subspace and leveraging a meta-model that takes the forecasts from the base-models in addition to the raw input sample and learns which base-model to trust or weigh the most conditional on the input data.
Diverse Ensemble Members: Partition input feature space into uniform subspaces.
This could be extended even more to first choose the one model variant that performs well among many for every subspace, keeping only those top-performing experts for every subspace, then learning how to ideally combine their forecasts.
Lastly, we might think of the meta-model as a correction of the base models. We might look into this concept and have several meta-models attempt to rectify overlapping or non-overlapping pools of contributing members and extra layers of models stacked on top of each other. This in-depth stacking of models is at times leveraged in machine learning contests and can become complicated and a challenge to train, but might provide extra benefit on forecasting tasks where better model skill largely outweighs the capability to introspect the model.
We can observe that the generality of the stacking method leaves a ton of room for experimentation and customization, where concepts from boosting and bagging might be integrated directly.
This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper.
- Stacking Ensemble Machine Learning with Python
- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python with Keras
- How to Implement Stacked Generalization (Stacking) from Scratch with Python
- How to Develop Voting Ensembles with Python
- How to Develop Super Learner Ensembles in Python
- Pattern Classification using Ensemble Methods, 2010
- Ensemble methods, 2012
- Ensemble machine learning, 2012
- Data mining: Practical machine learning tools and techniques, 2016
In this guide, you found out about the essence of the stacked generalization strategy to machine learning.
Particularly, you learned:
- The stacking ensemble method for machine learning leverages a meta-model to combine forecasts from contributing members
- How to distill the basic elements from the stacking method and how popular extensions such as blending and the super ensemble are related.
- How to devise new extensions to stacking by choosing fresh procedures for the basic elements of the method.