Ensemble Learning Algorithm Complexity and Occam’s Razor
Occam’s razor indicates that in machine learning, we should give preference to simpler models with lesser coefficients over complicated models like ensembles.
Taken at face value, the razor is a heuristic that indicates more complicated hypotheses make more assumptions that, in turn, will render them too narrow and not generalize well. With machine learning, it indicates complex models such as ensembles will overfit the training dataset and perform badly on fresh data.
Practically, ensembles are nearly universally the variant of model selected on projects where predictive skill is the most critical consideration. Further, empirical outcomes demonstrate a continued reduction in generalization error as the complexity of an ensemble learning model is incrementally improved. These discoveries are at odds with the Occam’s razor principle taken at face value.
In this guide, you will find out how to reconcile Occam’s Razor with ensemble machine learning.
After going through this guide, you will be aware of:
- Occam’s razor is a heuristic that indicates selecting simpler machine learning models as they are predicted to generalize better.
- The heuristic can be divided into dual razors, one of which is true and stays a useful tool and the other that is false and ought to be abandoned.
- Ensemble learning algorithms such as boosting furnish a particular scenario of how the second razor fails and added complexity can have the outcome of reduced generalization error.
This tutorial is subdivided into three portions, which are:
1] Occam’s Razor for Model Selection
2] Occam’s Two Razors for Machine Learning
3] Occam’s Razor and Ensemble Learning
Occam’s Razor for Model Selection
Model selection is the procedure of selecting one from among potentially several candidate machine learning models for a predictive modelling project.
It is typically straightforward to choose a model based on its expected performance, example, select the model with the highest precision or lowest prediction error.
Another critical consideration is to select simpler models over complex models.
Simpler models are usually defined as models that make lesser assumptions or have fewer elements, most typically characterized as fewer coefficients (example, rules, layers, weights, etc.). The rationale for selecting simpler models is connected back to Occam’s Razor.
The idea is that the best scientific theory is the smallest one that explains all the details.
Occam’s Razor is a strategy to problem-solving and is typically brought up to mean that if everything else is equal, we should give preference to simpler solutions.
Occam’s Razor – if everything else stays the same, the simplest solution is right.
It is named in honour of William of Ockham and was proposed to counter ever more intricate philosophy without equivalent increases in forecasting capability.
William of Occam’s famous razor specifies that “Nunquam ponenda est pluralitas sin necessitate,” which approximately translated, means “Entities should not be multiplied beyond necessity”.
It is not so much of a rule, more of a heuristic for problem-solving, and is typically invoked in science to give preference to simpler hypotheses that make lesser assumptions over more complicated hypotheses that make additional assumptions.
There is a long-standing tradition within science that, other things being equivalent, simpler theories are preferred to complicated ones. This is referred to as Occam’s Razor after the medieval philosopher William of Occam (or Ockham)
The issue with complicated hypotheses with increasing assumptions is that they are probably too specific.
They might consist of details of particular cases that are at hand or easily accessible, and in turn, might not generalize to new scenarios. That is, as the assumptions increase for a specific hypothesis, the more narrow it is expected to be in its application. Conversely, lesser assumptions indicates a more general hypothesis with increased predictive power to more cases.
- Simpler hypothesis: Lesser assumptions, and in turn, wide applicability
- Complex hypothesis: Increased assumptions, and in turn narrower applicability.
This has implications for machine learning, as we are particularly attempting to generalize to new unseen cases from particular observations, referenced to as inductive reasoning.
If Occam’s Razor indicates that more complex models don’t generalize well, then in applied machine learning, it indicates we ought to select simpler models as they will have reduced prediction errors on new data.
If this is the case, then how can we justify leveraging an ensemble machine learning algorithm?
By definition, ensemble ML algorithms are complicated than a singular machine learning model, as they are made up of several individual ML models.
Occam’s razor indicates that the added intricacy of ensemble learning algorithms implies that they will not generalize as well as simpler models fit on the same dataset.
Yet, ensemble ML algorithms are the dominant solution when predictive ability on fresh data is the most critical concern, like machine learning competitions. Ensembles have been researched at immense length and have been demonstrated not to overfit the training dataset in study after study.
It has been empirically observed that specific ensemble strategies often do not overfit the model, even when the ensemble consists of thousands of classifiers.
How can we achieve reconciliation for this inconsistency?
Occam’s Two Razors for Machine Learning
The conflict in between the expectation of simpler models generalizing better in theory and complicated models like ensembles generalizing better in practice was majorly ignored as an inconvenient empirical finding for a long time.
In the late 90s, the problem was particularly researched by Pedro Domingo’s and published in the award-winning 1996 paper entitled “Occam’s Two Razors: The Sharp and the Blunt” and follow-up 1999 journal article entitled “The Role of Occam’s Razor in Knowledge Discovery”
In the work, Domingo’s defines the issue as two specific usually asserted implications of Occam’s Razor in applied machine learning, which is referred to as “Occam’s Two Razors” in machine learning, they are (obtained from the paper):
- First razor: Provided two models with the same generalization error, the simpler one ought to be given preference as simplicity is desirable in itself.
- Second razor: Provided two models with the same training-set error, the simpler one ought to be given preference as it is probable to have reduced generalization error.
Domingos then enumerates a massive number of instances for and against every razor from both theory and empirical studies within machine learning.
The first razor indicates if two models possess the same expected performance on data not observed during training, we should give preference to the simpler model. Domingos highlights that this razor holds and furnishes a good heuristic on machine learning projects.
The second razor indicates that if dual models possess the same performance on a training dataset, then the simpler model should be selected as it is expected to generalize better when leveraged to make forecasts on fresh data.
This appears sensible on the surface.
If the argument behind not taking up ensemble algorithms in an ML project as they are really complicated in contrast to other models and expected to not generalize. It so happens that this razor can’t be supported by the evidence from the ML literature. All of this evidence indicates to the conclusion that is not just the second razor not true in general, it is also typically false in the variants of domains KDD has been applied to.
Occam’s Razor and Ensemble Learning
The finding starts to sound intuitive after you meditate on it for a while.
For instance, practically, we would not select a machine learning model on the basis of its performance on the training dataset alone. We intuitively, or probably following a lot of experience, tacitly expect the estimate of performance on a training set to be a bad estimate of performance on a holdout dataset.
We have this expectation as the model can overfit the training dataset.
Yet, less intuitively, overfitting the training dataset can cause improved performance on a holdout test set. This has been observed several times in practice in systematic studies.
A typical scenario involves plotting the performance of a model on the training dataset and a holdout test dataset every iteration of learning for the model, like training epochs or iterations for models that assist incremental learning.
If learning on the training dataset is established to continue for a big number of training iterations and the curves observed, it can usually be observed that performance on the training dataset will fall to zero error. This is to be expected as we might think that the model will overfit the training dataset provided adequate resources and time to train. Yet, the performance on the test set will continue to improve, even while the performance on the training set stays static at zero error.
On occasion, the generalization error would continue to improve a lot longer after the training error had reached zero.
This behaviour can be seen with ensemble learning algorithms such as boosting and bagging, where performance on the holdout dataset will continue to improve as extra model members are included to the ensemble.
One really surprising discovery is that carrying out more boosting iterations can reduce the error on fresh data long after the classification error of the combined classifier on the training data has dropped to nil.
That is, the model intricacy is incrementally enhanced, which systematically reduces the error or unseen data, e.g. generalization error. The extra training cannot enhance the performance on the training dataset, it has no potential enhancement to make.
Carrying out more boosting iterations without minimizing training error does not explain the training data any better, and it definitely includes complexity to the combined classifier.
This discovery directly goes up against the second razor and supports Domingos’ argument with regard to abandoning the second razor.
The first one is mostly uncontroversial, whereas the second one, taken literally, is not true.
This issue has been researched and can generally be explained by the ensemble algorithms learning to be more confident in their forecasts on the training dataset, which carry over to the hold out data.
The contradiction can be resolved through considering the classifier’s confidence in its forecasts.
The first razor stays a critical heuristic in applied machine learning.
The critical aspect of this razor is the predicate of “everything else being equivalent.” That is, if dual models are contrasted, they must be contrasted leveraging their generalization error on a holdout dataset or estimated leveraging k-fold cross-validation. If their performance is equivalent under these scenarios, then the razor can come into effect and we can select the simpler solution.
This is not the sole way to select models.
We might select a simpler model as it is simpler to interpret, and this stays valid if model interpretability is a more critical project requirement than predictive ability.
Ensemble learning algorithms are unanimously a more complicated variant of model when the number of model parameters is taken up to be the measure of complexity. As such, an open problem within machine learning consists of alternative measures of complexity.
This section furnishes additional resources on the subject if you are seeking to delve deeper.
- Occam’s Two Razors: The Sharp and the Blunt, 1998.
- The Role of Occam’s Razor in Knowledge Discovery, 1999.
- Ensemble methods in Data Mining: Improving Accuracy Through Combining Predictions, 2010.
- Pattern Classification Using Ensemble Methods, 2010.
- Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- Occam’s Razor, Wikipedia
- William of Ockham, Wikipedia
In this guide, you found out how to reconcile Occam’s Razor with ensemble machine learning.
Particularly, you learned:
- Occam’s razor is a heuristic that suggests selecting simpler machine learning models as they are expected to generalize in a better fashion.
- The heuristic can be divided into dual razors, one of those is true and stays a useful tool, and the other that is false and should be abandoned.
- Ensemble learning algorithms such as boosting furnish a particular case of how the second razor fails and included complexity can have the outcome of reduced generalization error.