Combining predictions for ensemble learning
Ensemble methods consist of bringing together the predictions from several models.
The combination of the predictions is a fundamental aspect of the ensemble method and is contingent heavily on the variants of models that contribute to the ensemble and the variant of prediction issue that is being modelled, like a classification or regression.
Nonetheless, there are typical or standardized strategies that can be leveraged to bring together the predictions that can be simply implemented and typically have the outcome of good or best predictive performance.
In this blog article by AICorespot, you will be aware of:
- Bringing together these predictions from contributing models is a critical attribute of an ensemble model
- Voting strategies are most typically leveraged when bringing together predictions for classification
- Statistical strategies are most typically leveraged when bringing together predictions for regression
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1, Bringing together predictions for ensemble learning
2, Bringing together classification predictions
- Combining predicted class labels
- Combining predicted class probabilities
3, Combining regression predictions
Combining Predictions for Machine Learning
A critical aspect of ensemble learning method consists of bringing together the predictions from several models.
It is via the combo of the forecasts that the advantage of the ensemble learning method is accomplished, particularly improved predictive performance. As such, there are several ways that forecasts can be brough together, so much so that it is a complete domain of study.
After producing a group of base learners, instead of attempting to identify the best single learner, ensemble strategies resort to combo to accomplish a strong generalization capability, where the combo method has a critical part.
Conventional ensemble machine learning algorithms do prescribe how to bring together predictions, nonetheless, it is critical to take up the topic in isolation for a variety of purposes, like:
- Interpreting the forecasts made by conventional ensemble algorithms.
- Manually mentioning a custom prediction combination method for an algorithm
- Developing your own ensemble methods.
Ensemble learning strategies are usually not very complicated and producing your own ensemble method or mentioning the fashion in which forecasts are brought together is comparatively simple and usual practice.
The fashion that predictions are brought together depends on the models that are making forecasts and the variant of prediction problem.
The strategy leveraged in this step is dependent, partially, on the variant of classifiers leveraged as ensemble members. For instance, some classifiers, like support vector machines, furnish just discrete-valued label outputs.
For instance, the form of the forecasts made by the models will match the variant of prediction problem, like regression for forecasting numbers and classification for forecasting class labels. Also, some model types may just be able to forecast a class label or class probability distribution, while others may be capable to support both for a classification activity.
We will leverage this division of forecast type on the basis of problem type as the basis for looking into the usual strategies leveraged to bring together predictions from contributing models within an ensemble.
In the next part of the blog post, we will look into how to bring together predictions for classification predictive modelling tasks.
Combining Classification Predictions
Classification is a reference to predictive modelling problems that consist of predicting a class label provided an input.
The forecast made by a model might be a crisp class label directly or might be a probability that an instance belongs to every class, referenced to as the probability of class membership.
The performance of a classification problem is typically measured leveraging precision or a connected count or ration of right predictions. In the scenario of assessing predicted probabilities, the might be converted to crisp class labels by choosing a cut-off threshold, or assessed leveraging specialized metrics like cross-entropy.
We will review bringing together predictions for classification separately for both class labels and probabilities.
Combining Predicted Class Labels
A forecasted class label is typically mapped to something meaningful to the problem domain.
For instance, a model might forecast a colour like “red” or “green”. Internally though, the model forecasts a numerical representation for the class label like 0 for “red”, 1 for “green”, and 2 for “blue” for our colour classification instance.
Methods for combining class labels are probably simpler to take up if we operate with the integer encoded class labels directly.
Probably the simplest, most typical, and most often efficient strategy is to combine the predictions through voting.
Voting is the most widespread and basic combination method for nominal outputs.
Voting typically consists of every model that makes a prediction allocating a vote for the class that was forecasted. The votes are tallied and a result is then selected leveraging the votes or tallies in some fashion.
There are several variants of voting, so let’s observe the four most typical:
- Plurality voting
- Majority voting
- Unanimous voting
- Weighted voting
Simple voting, referred to as plurality voting, chooses the class label with the most votes.
If two or more classes have the similar number of votes, then the connect is broken arbitrarily, even though in a regular fashion, like sorting the class labels that possess a tie and choosing the first, rather than choosing one arbitrarily. This is critical so that the same model with the similar data always makes the same forecast.
Provided ties, it is typical to possess an odd number of ensemble members in an effort to automatically break ties, in opposition to an even number of ensemble members where ties might be more probable.
From a statistical viewpoint, this is referred to as the mode or the most typical value from the collection of predictions.
For instance, take up the three forecasts made by a model for a three-class colour forecast problem.
- Model 1 predicts “green” or 1.
- Model 2 predicts “green” or 1.
- Model 3 predicts “red” 0.
The votes are thus,
- Red votes: 1
- Green votes: 2
- Blue votes: 0
The forecast would be “green” provided that it possesses the most votes.
Majority voting chooses the class label that has more than half the votes. If no class has in excess of half the votes, then a “no prediction” is made. Fascinatingly, majority voting can be proven to be an optimal method for combining classifiers if they are independent.
If the classifier outputs are independent, then it can be demonstrated that majority voting is the optimal combo rule.
Unanimous voting is connected to majority in that rather than needing 50% of the votes, the strategy needs all models to forecast the same value, otherwise, no forecast is made.
Weighted voting weights the prediction made by every model in some fashion. One instance would be to weight forecasts on the basis of the average performance of the model, like classification precision.
The weight of every classifier can be set proportional to its precision performance on a validation set.
Allocating weights to classifiers can become a project in and of itself and could consist leveraging an optimization algorithm and a holdout dataset, a linear model, or even another machine learning model totally.
So how do we allocate the weights? If we knew, a priori, which classifiers would work better, we would only leverage those classifiers. In the absence of such data, a plausible and typically leveraged strategy is to leverage the performance of a classifier on a separate validation (or even training) dataset, as an estimate of the classifier’s generalization performance.
The concept of weighted voting is that a few classifiers are more probable to be precise than others and we ought to reward them by providing them a bigger share of the votes.
If we have purpose to believe that some of the classifiers are more probable to be right than others, weighting the decisions of those classifiers more heavily can further enhance the overall performance contrasted to that of plurality voting.
Combining Predicted Class Probabilities
Probabilities summarize the likelihood of an event as a numerical value between 0.0 and 1.0.
When forecasted for class membership, it consists of a probability allocated for every class, together summing to the value 1.0; for instance, a model might forecast.
- Red: 0.75
- Green: 0.10
- Blue: 0.15
We can observe that class “red” has the highest probability or is the most probable result forecasted by the model and that the distribution of probabilities across the classes (0.75 + 0.10 + 0.15) sum to 1.0
The fashion that the probabilities are brought together is dependent on the outcome that is needed.
For instance, if probabilities are needed, then the independent forecasted probabilities can be combined directly.
Probably the simplest strategy for combining probabilities to sum the probabilities for every class and pass the forecasted values via a softmax function. This makes sure that the scores are appropriately normalized, implying the probabilities across the class labels sum to 1.0.
Such outputs upon proper normalization (like softmax normalization […]) – can be interpreted as the degree of support provided to that class.
More typically we desire to forecast a class label from forecasted probabilities.
The most typical strategy is to leverage voting, where the forecasted probabilities indicate the vote made by every model for every class. Votes are then summed and voting strategy from the prior section can be leveraged, like choosing the label with the biggest summed probabilities or the biggest mean probability.
- Vote leveraging mean probabilities
- Vote leveraging sum probabilities
- Vote leveraging weighted sum probabilities
Typically, this strategy to treating probabilities as votes for selecting a class label is referenced to as soft voting.
If all the individual classifiers are regarded equally, the simple soft voting produces the combined output by merely averaging all the individual outputs.
Combining Regression Predictions
Regression is a reference to predictive modelling problems that consist of forecasting a numeric value provided an input.
The performance for a regression problem is typically measured leveraging average error, like mean absolute error or root mean squared error.
Combining numerical predictions often consists of leveraging simple statistical strategies, for instance:
- Mean predicted value
- Median predicted value
Both provide the central tendency of the distribution of predictions.
Averaging is the most widespread and basic combination method for numeric outputs.
The mean, also referenced to as the average, is the normalized sum of the predictions. The mean predicted value is more relevant when the when the distribution of predictions is Gaussian or nearly Gaussian.
For instance, the mean is calculated as the total of forecasted values divided by the total number of forecasts. If three models forecasted the following prices:
- Model 1: 99.00
- Model 2: 101.00
- Model 3: 98.00
The mean forecasted would be calculated as:
- Mean prediction = (99.00 + 101.00 + 98.00) / 3
- Mean prediction = 298.00 / 3
- Mean prediction = 99.33
Due to its simplicity and effectiveness, simple averaging is amongst the most widely leveraged strategies and is representative of the first selection in several real applications.
The median is the middle value if all predictions were ordered and is also referenced to as the fifty-th percentile. The median predicted value is more relevant to leverage when the distribution of forecasts is not known or does not follow a Gaussian probability distribution.
Dependent on the nature of the forecasting problem, a conservative forecast might be desired, like the maximum or the minimum. Also, the distribution can be summarized to provide a measure of uncertainty, like reporting three values for every prediction.
- Minimum predicted value
- Median predicted value
- Maximum predicted value
Just like with classification, the predictions made by every model can be weighted by expected model performance or some other value, and the weighted mean of the predictions can be reported.
Further Reading
This part of the blog furnishes additional resources on the subject if you are seeking to delve deeper.
Books
- Patter Classification using Ensemble Methods, 2010
- Ensemble Methods, 2012
- Ensemble Machine Learning, 2012
- Ensemble Methods in Data Mining, 2010
Articles
- Ensemble learning, Wikipedia
- Ensemble learning, Scholarpedia
Conclusion
In this blog article, you found out about typical strategies for combining predictions for ensemble learning.
Particularly, you learned:
- Combining predictions from contributing models is a key attribute of an ensemble model.
- Voting strategies are most typically leveraged when combining predictions for classification
- Statistical strategies are most typically leveraged when combining predictions for regression.