Calculus within machine learning – why it’s a good fit
Calculus is one of the fundamental mathematical ideas within machine learning that enables us to comprehend the inner workings of differing machine learning algorithms.
One of the critical applications of calculus within machine learning is the gradient descent algorithm, which, combined with backpropagation, facilitates us to go about training a neural network model.
In this guide, you will find out about the critical part of calculus within machine learning.
After going through this guide, you will be aware of:
- Calculus has a critical part in comprehending the inner workings of machine learning algorithms, like the gradient descent algorithm for minimizing an error function.
- Calculus furnishes us with the required tools to go about optimizing complicated objective functions as well as functions with multidimensional inputs, which are indicative of differing machine learning applications.
Tutorial Summarization
This tutorial is subdivided into two portions, which are:
- Calculus within machine learning
- Why calculus in machine learning is a good fit
Calculus within Machine Learning
A neural network model, regardless of whether shallow or deep, goes about implementing a function that maps a group of inputs to expected outputs.
The function implemented by the neural network is learned via training procedure, which iteratively looks for a set of weights that best facilitate the neural network to model the variations within the training data.
A very simple variant of function is a linear mapping from a single input to a single output.
Such a linear function can be indicated by the equation of a line possessing a slope, m, and y-intercept, c:
y=mx+c
Varying every one of the parameters, m and c, generates differing linear models that define differing input-output meanings.
The procedure of learning the mapping function, thus, consists of the approximation of these model parameters, or weights, that have the outcome of the minimum error amongst the forecasted and target outputs. This error is quantified by means of a loss function, cost function, or error function, as typically leveraged interchangeably, and the procedure of minimizing the loss is referred to as function optimization.
We can apply differential calculus to the procedure of function optimization.
In order to comprehend better how differential calculus can be applied to function optimization, let us return to our particular instance of possessing a linear mapping function.
State that we possess some dataset of singular input features, x, and their correlating target outputs, y. In order to measure the error on the dataset, we shall be taking the total of squared errors (SSE), computed amongst the forecasted and target outputs, as our loss function.
Carrying out a parameter sweep across differing values for the model weights, w0 = m and w1 = c produces individual error profiles that are convex in shape.
Bringing together the individual error profiles produces a 3D error surface which is also convex in shape. This error surface is consisted within a weight space, which is defined by the swept ranges of values for the model weights, w0 and w1.
Moving across this weight space is equal to shifting amongst differing linear models. Our goal is to identify the model that ideally fits the data amongst all potential alternatives. The ideal model is signified by the lowest error within the dataset, which correlates with the lowest point on the error surface.
A convex or bowl-shaped error surface is incredibly useful for learning a linear function to model a dataset as it implies that the learning procedure can be framed as a search for the lowest point on the error surface. The standard algorithm leveraged to identify this lowest point is referred to as gradient descent.
The gradient descent algorithm, as the optimisation algorithm, will look to reach the lowest point on the error surface by following its gradient downhill. The descent is on the basis of the computation of the gradient, or slope, of the error surface.
This is where differential calculus comes into the picture.
Calculus, and specifically differentiation, is the domain of mathematics that deals with the rates of change.
More formally, let us signify the function that we would wish to optimize by:
Error = f(weights)
Through computation of the rate of change, or the slope, of the error with regards to the weights, the gradient descent algorithm can determine on how to alter the weights in order to keep minimizing the error.
Why Calculus within Machine Learning is a good fit
The error function that we have taken up to optimize is considerably simple, as it is convex and personified by a singular global minimum.
Nevertheless, within the context of machine learning, we often require to optimize more complicated functions that can make the optimisation activity very much a challenge. Optimisation can become even more of a challenge if the input to the function is also multidimensional.
Calculus furnishes us with the required tools to tackle both hurdles.
Assume that we have a more generic function that we desire to minimize, and which takes a real input, x, to generate a real output, y:
y=f(x)
Computation of the rate of change at differing values of x is useful as it provides us an indication of the changes that we require to apply to x, in order to get the correlating changes in y.
As we are minimizing the function, our objective is to attain a point that obtains as low a value of f(x) as feasible that is also personified by zero rate of change, therefore, a global minimum.
Dependent on the intricacy of the function, this may not necessarily be feasible as there might be several local minima or saddle points that the optimisation algorithm may stay caught into.
In the context of deep learning, we go about optimizing functions that may possess several local minima that are not optimal, and several saddle points surrounded by very flat regions.
Therefore, within the context of deep learning, we usually accept a suboptimal solution that may not necessarily correlate to a global minimum, as long as it corresponds to a very minimal value of f(x).
If the function we are working with takes several inputs, calculus also furnishes us with the idea of partial derivatives, or in layman’s terms, a strategy is to calculate the rate of change of y with regards to changes in every one of the inputs, xi while holding the remaining inputs constant.
This is why each of the weights is updated independently in the gradient descent algorithm: the weight update rule is dependent on the partial derivatives of the SSE for every weight, and as there is a differing partial derivative for every weight, there is a separate weight update rule for every weight.
Therefore, if we take up again the minimization of an error function, calculating the partial derivative for the error with regards to every particular weight permits that every weight is updated independently of the others.
This also implies that the gradient descent algorithm may not adhere to a straight path down the error surface. Rather, every weight will be updated proportionally to the local gradient of the error curve. Therefore, one weight can be updated by a larger amount that another, as much as required for the gradient descent algorithm to attain the function minimum.