Why optimization is critical in the domain of machine learning
Machine learning entails leveraging an algorithm to learn and generalize from historical information to make forecasting on fresh data.
This issue can be detailed as approximation of a function that maps instances of inputs to instances of outputs. Approximation of a function can be resolved through framing of the issue as function optimization. This is where a machine learning algorithm goes about defining a parameterized mapping function (e.g. a weighted total of inputs) and an optimization algorithm is leveraged to fund the values of the parameters (e.g. model coefficients) that reduce the error of the function when leveraged to map inputs to outputs.
This implies that every time we fit a machine learning algorithm on a training dataset, we are identifying a solution to an optimization issue.
In this blog post by AICoreSpot, you will find out about the fundamental role of optimization within machine learning.
- Machine learning algorithms carry out function approximation, which resolved leveraging function optimization.
- Function optimization is the reason behind why we reduce error, expenditure, or less when fitting a machine learning algorithm
- Optimization is also carried out in data preparation, hyperparameter tuning, and model selection in a predictive modelling project.
This guide is demarcated into three portions, which are:
- Machine learning and optimization
- Learning as optimization
- Optimization within a machine learning project
- Data preparation as optimization
- Hyperparameter Tuning as optimization
- Model Selection as optimization
Function optimization is the issue of identifying the grouping of inputs to a targeted objective function that have the outcome of the minimum or maximum of the function. It can be a challenge as an issue as the function might possess tens, hundreds, thousands, or even millions of inputs, and the structure of the function is not known, and usually non-differentiable and noisy.
Function optimization: Identify the grouping of inputs that has the outcome of the minimum or maximum of an objective function.
Machine learning can be detailed as function approximation. Which is, approximation of the unknown underlying function that maps instances of inputs to outputs to make forecasting on fresh data.
It can be a challenge as there is typically a restricted number of instance from which we can go about approximating the function, and the structure of the function which is being approximated is typically nonlinear, noisy, and might even consist of contradictions.
Function approximation: Generalize from particular instances to a reusable mapping function for making forecasting on new instances.
Function optimization is typically a simpler procedure than function approximation.
At the base of almost all machine learning algorithms is an optimization algorithm.
Additionally, the procedure of working through a predictive modelling problem consists of optimization at various steps over learning a model, which includes:
- Opting for the hyperparameters of a model
- Opting for the transforms to apply to the data before modelling
- Choosing the modelling pipeline to leverage as the final model.
Now that we are aware that optimization has a central role in machine learning, let’s observe some instances of learning algorithms and how they leverage optimization.
Predictive modelling issues consist of making a forecasting from an instance of input.
A numeric quantity must be forecasted in the scenario of a regression issue, whereas a class label must be forecasted in the scenario of a classification issue.
The issue of predictive modelling is adequately challenging that we can’t author code to make predictions. Rather, we must leverage a learning algorithm to historical information to learn a “program” in the scenario of a classification issue.
Within statistical learning, a statistical perspective on machine learning, the issue is framed as the learning of a mapping function (f) provided instances of input data (X) and related output data (y).
y = f(x)
Provided new instances of input (Xhat) we must map every instance onto the predicted output value (yhat) leveraging our learned function (fhat)
The learned mapping will not be perfect. No model is bulletproof, and some prediction error is expected provided the difficulty of the issue, noise in the observed data, and the choice of learning algorithm.
Mathematically, learning algorithms resolve the issue of approximation of the mapping function that has the outcome of minimum loss, minimum cost, or minimum prediction error.
The more biased or limited the option of mapping function, the simpler the optimization is to solve.
Let’s observe some instances to make this obvious:
A linear regression (for regression problems) is a very constrained model and can be solved analytically leveraging linear algebra. The inputs to the mapping function are the coefficients of the model.
We can leverage an optimization algorithm, like a quasi-Newton local search algorithm, but it will nearly always be less efficient than the analytical solution.
Linear regression: Function inputs are model coefficients, optimization issues that can be solved analytically.
A logistic regression (for classification issues) is a bit less constrained and must be solved as an optimization issue, even though something about the structure of the optimization function being solved is known provided the constraints imposed by the model.
This implies a local search algorithm like a quasi-Newton methodology can be leveraged. We could also leverage a global search such as stochastic gradient descent, but it will nearly always be less efficient.
Logistic regression: Function inputs are model coefficients, optimization issues that need an iterative local search algorithm.
A neural network model is a really flexible learning algorithm that imposes minimal constraints. The inputs to the mapping function are the network weights. A local search algorithm can’t be leveraged provided the search space is multimodal and very nonlinear, rather, a global search algorithm must be leveraged.
A global optimization algorithm is typically leveraged, particularly stochastic gradient descent, and the updates are rendered in a fashion that is aware of the structure of the model (backpropagation and the chain rule). We could leverage a global search algorithm that is not aware of the structure of the model, such as a genetic algorithm, but it will nearly always be less efficient.
Neural network: Function inputs serve as model weights, optimization issues that need an iterative global search algorithm.
We can observe that every algorithm makes varying assumptions about the form of the mapping function, which has an impact on the type of optimization issue to be solved.
We can additionally observe that the default optimization algorithm leveraged for every machine learning algorithm is not random, it indicates the most efficient algorithm for finding a solution to the particular optimization problem framed by the algorithm, for example, stochastic gradient descent for neural nets over a genetic algorithm. Diverging from these defaults needs a good reason.
Not every machine learning algorithm solves an optimization problem. A noteworthy instance is the k-nearest neighbours algorithm that records the training dataset and performs a lookup for the k best matches to every new instance in order to go about predicting.
Now that we are acquainted with learning in machine learning algorithms as optimization, let’s observe a few related instances of optimization within a machine learning project.
Optimization has a critical role to play within a machine learning project in addition to fitting the learning algorithm on the training dataset.
The step of prepping the information before fitting the model and the step of tuning a chosen model also can be framed as an optimization problem. As a matter of fact, a complete predictive modelling project can be viewed as one big optimization problem.
Let’s delve deeper and take a closer look at each of these cases.
Data prep consists of transforming raw data into a format that is most relevant for the learning algorithms.
This may consist of scaling values, managing missing values, and altering the probability distribution of variables.
Transforms can be made to alter representation of the historical information to meet the expectations or requirements of particular learning algorithms. However, at times, good or best outcomes can be accomplished when the expectations are violated, or when an unrelated transform to the information is performed.
We can view choosing transforms to apply to the training information, as a search or optimization issue of best unveiling the unknown underlying structure of the information to the learning algorithm.
Data prep: Function inputs are sequences of transforms, optimization issues that need an iterative global search algorithm.
This optimization issue is typically carried out manually with human-based trial and error. Nonetheless, it is doable to automate this activity leveraging a global optimization algorithm where the inputs to the function are the types and order of transforms applied to the training information.
The number and permutations of data transforms are usually quite restricted and it might be doable to carry out an exhaustive search or a grid search of typically leveraged sequences.
Machine learning algorithms possess hyperparameters that can be configured to tailor the algorithm to a particular dataset.
Even though the dynamics of several hyperparameters are known, the particular impact they will have on the performance of the outcome model on a provided dataset is not known. As such, it is a typical practice to evaluate a suite of values for key algorithm hyperparameters for a chosen machine learning algorithm.
This is referred to as hyperparameter tuning or hyperparameter optimization.
It is typical to leverage a naïve optimization algorithm for this reason, such as an arbitrary search algorithm or a grid search algorithm.
Hyperparameter Tuning: Function inputs are algorithm hyperparameters, optimization problems that need an iterative global search algorithm.
Nonetheless, it is becoming more and more typical to leverage an iterative global search algorithm for this optimization issue. A popular option is a Bayesian optimization algorithm that has the potential to simultaneously approximate the target function that is being optimized (leveraging a surrogate function) during the process of optimization.
This is desired as assessing a singular combo of model hyperparameters is costly, needing fitting the model of the complete training dataset one or several times, dependent on the choice of model assessment procedure (for instance, repeated k-fold cross validation)
Model selection consists of choosing one from amongst several candidate machine learning models for a predictive modelling problem.
Actually, it consists of choosing the machine learning algorithm or machine learning pipeline that generates a model. This is then leveraged to train a final model that can then be leveraged in the desired application to make forecasts on the fresh data.
This procedure of model selection is typically a manual process carried out by a machine learning practitioner consisting of activities like prepping data, assessing candidate models, tuning well-performing models, and lastly opting for the final model.
This can be framed as an optimization problem that subsumes portion of or the total predictive modelling project.
Model selection: Function inputs are data transform, machine learning algorithm, and algorithm hyperparameters, optimization problem that needs an iterative global search algorithm.
Increasingly, this is the scenario with automated machine learning (AutoML) algorithms being leveraged to choose an algorithm, an algorithm and hyperparameters or data prep, algorithm and hyperparameters, with minimal user intervention.
Like hyperparameter tuning, it is typical to leverage a global search algorithm that also goes about approximating the objective function, like Bayesian optimization, provided that every function evaluation is expensive.
This automated optimization strategy to machine learning also lies at the heart of sophisticated machine learning as a service (MLaaS) products furnished by enterprises like Microsoft, Google, and Amazon.
Even though quick and efficient, such strategies are still not able to outpace hand-crafted models prepped by highly skilled specialists, like those taking part in machine learning contests.