>Business >An intro to Mixture of Experts and Ensembles

An intro to Mixture of Experts and Ensembles

Mixture of experts is an ensemble learning strategy produced in the domain of neural networks.

It consists of decomposing predictive modelling tasks into sub-tasks, training an expert model on each, producing a gating model that learns which expert to trust on the basis of the input to be forecasted, and combines the predictions.

Even though the strategy was initially detailed leveraging neural network experts and gating models. It can be generalized to leverage models of any variant. As such, it displays a robust similarity to stacked generalization and belongs to the class of ensemble learning strategies are referenced to as meta-learning.

In this guide, you will find out about the mixture of experts approach to ensemble learning.

After going through this guide, you will be aware of:

  • An intuitive strategy to ensemble learning consists of dividing a task into subtasks and developing an expert on every subtask.
  • Mixture of experts is an ensemble learning strategy that looks to explicitly address a predictive modelling problem in terms of subtasks leveraging expert models.
  • The divide and conquer strategy is connected to the construction of decision trees, and the meta-learner approach is connected to the stacked generalization ensemble method.

Tutorial Summarization

This tutorial is subdivided into three portions, which are:

1] Subtasks and experts

2] Mixture of experts

  • Subtasks
  • Expert models
  • Gating Model
  • Pooling Method

3] Relationship with other strategies

  • Mixture of experts and decision trees
  • Mixture of experts and stacking

Subtasks and Experts

Some predictive modelling tasks are very complicated, even though they might be apt to a natural division into subtasks.

For instance, take up a one-dimensional function that possesses a complicated shape like an S in two dimensions. We could make an effort to devise a model that models the function completely, but if we are aware of the functional form, the S-shape, we could also divide up the issue into three portions: the curve at the top, the curve at the bottom, and the line linking the curves.

This is a divide and conquer strategy to problem-solving and underlies several automated strategies to predictive modelling, in addition to the problem-solving more widely.

This strategy can also be looked into as the foundation for generating an ensemble learning strategy.

For instance, we can divide the input feature space into subspaces on the basis of some domain knowledge of the problem. A model can then receive training on each subspace of the problem, being essentially an expert on the particular subproblem. A model then goes about learning which expert to call upon to forecast new instances in the future.

The subproblems might or might not overlap, and experts from similar or connected subproblems might be able to contribute to the instances that are technically outside of their expertise.

This strategy to ensemble learning underlies a strategy referenced to as a mixture of experts.

Mixture of Experts

Mixture of Experts, MoE or ME for short, is an ensemble learning strategy that implements the concept of training experts on subtasks of a predictive modelling problem.

In the neural network community, various researchers have examined the decomposition methodology, […] Mixture-of-Experts (ME) methodology that decomposes the input space, such that every expert examines a differing portion of the space […] A gating network is accountable for combining the several experts.

There are four aspects to the approach, which are:

  • Division of a task into subtasks
  • Develop an expert for every subtask
  • Leverage a gating model to determine which expert to leverage
  • Pool predictions and gating model output to make a prediction

The image below, taken from Page 94 of the 2012 book, “Ensemble Methods”, furnishes a helpful overview of the architectural elements of the method.


The first step is to divide the predictive modelling problem into subtasks. This typically consists of leveraging domain knowledge. For instance, an image could be divided into individual elements like background, foreground, objects, colours, lines, and so on.

ME operates in a divide-and-conquer technique where a complicated task is divided up into various simpler and smaller subtasks, and individual learners (referred to as experts) are trained for differing subtasks.

For those problems where the breaking down of the task into subtasks is not obvious, a simpler and more generic strategy could be leveraged. For instance, one could visualize a strategy that divides the input feature space through groups of columns or separates instances in the feature space on the basis of distance measures, inliers, and outliers for a standard distribution, and a lot more.

Within ME, a critical problem is how to identify the natural division of the task and then obtain the overall solution from sub-solutions.

Expert Models

Then, an expert is developed for every subtask.

The mixture of experts strategy was initially made and explored within the domain of artificial neural networks, so conventionally, experts themselves are neural network models leveraged to forecast a numerical value in the scenario of regression or a class label in the scenario of classification.

It should be obvious that we can “plug in” any model for the expert. For instance, we can leverage neural networks to represent both the gating functions and the experts. The outcome is known as a mixture density network.

Experts each obtain the same input pattern (row) and make a prediction.

Gating Model

A model is leveraged to interpret the predictions made by each expert and to assist in determining which expert to trust for a provided input. This is referred to as the gating model, or the gating network, provided that it is conventionally a neural network model.

The gating network takes as input the input pattern that was furnished to the expert models and outputs the contribution that every expert should have in making a forecast for the input.

The weights determined by the gating network are dynamically allocated on the basis of any provided input, as the MoE effectively learns which portion of the feature space is learned by every ensemble member.

The gating network is critical to the approach and effectively the model learns to select the variant subtask for a provided input, and in turn, the expert to trust to make a strong forecast.

Mixture-of-experts can also be observed as a classifier selection algorithm, where individual classifiers are trained to become experts to become experts in some portion of the feature space.

When neural network models are leveraged, the gating network and the experts are trained together such that the gating network learns when to trust each expert to make a forecast. This training procedure was conventionally implemented leveraging expectation maximization (EM). The gating network might possess a softmax output that provides a probability-like confidence score for every expert.

Generally, the training procedure attempts to accomplish dual objectives, for provided experts to identify the optimal gating function, for a provided gating function, to train the experts on the distribution mentioned by the gating function.

Pooling Method

Lastly, the mixture of expert models must make a forecast, and this is accomplished leveraging a pooling of aggregation mechanism. This may be as simplistic as choosing the expert with the biggest output of confidence furnished by the gating network.

Alternatively, a weighted sum forecast could be made that overtly brings together the predictions made by every expert and the confidence estimated by the gating network. You could visualize other strategies to making efficient leveraging of the predictions and gating network output.

The pooling/combining system might then select a singular classifier with the highest weight, or calculate a weighted sum of the classifier outputs for every class, and pick the class that receives the highest weighted sum.

Relationship with other Strategies

The mixture of experts strategy is less widespread at present, probably as it was detailed in the domain of neural networks.

Nonetheless, more than a quarter of a century of advancements and exploration of the strategy have happened and you can observe a great summarization in the 2012 paper “Twenty Years of Mixture of Experts”

Critically, it is recommended considering the wider intent of the strategy and looks into how you might leverage it on your own predictive modelling problems.

For instance:

  • Are there overt or systematic ways that you can divide your predictive modelling problem into subtasks?
  • Are there specialized strategies that you can train on every subtask?
  • Consider developing a model that forecasts the confidence of every expert model.

Mixture of Experts and Decision Trees

We can additionally observe a relationship between a mixture of experts to Classification and Regression Trees, often referenced to as CART.

Decision trees are fitted leveraging a divide and conquer strategy to the feature space. Every split is selected as a constant value for an input feature and every sub-tree can be considered a sub-model.

Mixture of experts was mostly researched in the neural networks community. In this thread, analysts typically consider a divide-and-conquer technique, attempt to learn a mixture of parametric models jointly and leverage combining rules to obtain an overall solution.

We should take a similar recursive decomposition strategy to decomposing the predictive modelling activity into subproblems when developing the mixture of experts. This is typically referenced to as a hierarchal mixture of experts.

The hierarchal mixtures of experts (HME) procedure can be viewed as a variation of tree-based strategies. The primary difference is that the tree splits are not hard decisions but rather soft probabilistic ones.

Unlike decision trees, the division of the activity into subtasks is often explicit and top-down. Also, unlike a decision tree, the mixture of experts makes an effort to survey all of the expert submodels rather than a singular model.

There are other variations between HMEs and the CART implementation of trees. In an HME, a linear (or logistic regression) model is fitted in every terminal node, rather than a constant as in CART. The splits can be multiway, not just binary, and the splits are probabilistic functions of a linear combo of inputs instead of a single input as in the traditional leveraging of CART.

Nonetheless, these variations might inspire variations on the strategy for a provided predictive modelling problem.

For instance:

  • Consider automatic or general approaches to dividing the feature space or problem into subtasks to assist to widen the suitability of the method.
  • Consider exploring both combo methods that trust the best expert, in addition to methods that seek a weighted consensus across experts.

Mixture of Experts and Stacking

The application of the technique does not have to be restricted to neural network models and an array of standard machine learning techniques can be leveraged in place seeking a similar end.

In this fashion, the mixture of experts method belongs to a wider categorization of ensemble learning methods that would also consist of stacked generalization, referred to as stacking. Like a mixture of experts, stacking trains a diverse ensemble of machine learning models and then learns a higher-order model to ideally combine the predictions.

We might make references to this class of ensemble learning strategies as meta-learning models. That is models that make an effort to learn from the output or learn how to ideally combine the output of other lower-level models.

Meta-learning is a procedure of learning from learners (classifers(, […] in order to induce a meta classifier, first the base classifiers are trained (stage one), and then the Meta classifier (second stage)

Unlike a mixture of experts, stacking models are often all fit on the same training dataset, for example, no decomposition of the task into subtasks. And also not like a mixture of experts, the higher-level model that brings together the predictions from the lower-level models usually does not obtain the input pattern furnished to the lower-level model and rather takes as input the predictions from each lower-level model.

Nonetheless, there is no reason why hybrid stacking and mixture of expert models can’t be developed that might have better performance that either strategy in isolation on a provided predictive modelling problem.

For instance:

  • Consider treating the lower-level models in stacking as experts underwent training on differing perspectives of the training data. Probably this would consist of leveraging a softer approach to decomposing the problem into subproblems where differing data transforms or feature selection methods are leveraged for every model.
  • Consider furnishing the input pattern to the meta model in stacking in an attempt to make the weighting or contribution of lower-level models conditional on the particular context of the forecast.

Further Reading

This portion of the blog furnishes additional resources on the subject if you are looking to delve deeper.


Twenty Years of Mixture of Experts, 2012


Pattern Classification Using Ensemble Methods, 2010.

Ensemble Methods, 2012.

Ensemble Machine Learning, 2012

Ensemble Methods in Data Mining, 2010

The Elements of Statistical Learning, 2016

Machine Learning: A Probabilistic Perspective, 2012

Neural Networks for Pattern Recognition, 1995

Deep Learning, 2016


Ensemble learning, Wikipedia

Mixture of experts, Wikipedia


In this guide, you found out about mixture of experts approach to machine learning.

Particularly, you learned:

  • An intuitive strategy to ensemble learning consists of dividing a task into subtasks and developing an expert on each subtask.
  • Mixture of experts is an ensemble learning method that looks to explicitly tackle a predictive modelling problem in terms of subtasks leveraging expert models.
  • The divide and conquer approach is connected to the construction of decision trees, and the meta-learner approach is connected to the stacked generalization ensemble method.
Add Comment