Difference between backpropagation and Stochastic Gradient Descent
There is a ton of confusion for starters with regard to what algorithm is leveraged for training deep learning neural network models.
It is typical to hear neural networks go about learning leveraging the “back-propagation of error” algorithm or “stochastic gradient descent”. At times, either of these algorithms is leveraged as a shorthand for how a neural net is fitted on a training dataset, even though in several scenarios, there is a deep confusion with regards to what these algorithms exactly are, how they are connected, and how they might operate together.
This guide is developed to make the role of stochastic gradient descent and back-propagation algorithms clear in training between networks.
In this guide, you will find out what distinguishes stochastic gradient descent from the back-propagation algorithm.
After going through this guide, you will be aware of:
- Stochastic gradient descent is an optimization algorithm for reducing the loss of a predictive model with regards to a training dataset.
- Back-propagation is automatic differentiation algorithm for calculating gradients for the weights within a neural network graph structure.
- Stochastic gradient descent and the back-propagation of error algorithms combined are leveraged to train neural network models.
Stochastic Gradient Descent
Gradient Descent is an optimization algorithm that identifies the grouping of input variables for a targeted function that has the outcome of a minimum value of the target function, referred to as the minimum of the function.
As its name indicates, gradient descent consists of calculating the gradient of the targeted functions.
You might remember from calculus that the first-order derivative of a function calculates the slope or curvature of a function at a set point. Read from left to right, a positive derivative indicates the targeted function is sloping uphill and a negative derivative indicates the targeted function is sloping downhill.
Derivative: Slope or curvature of a target function with regard to particular input values to the function.
If the target function takes several input variables, they can be taken together as a vector of variables. Operating with vectors and matrices is referred to linear algebra and performing calculus with structures from linear algebra is referred to as matrix calculus or vector calculus. In vector calculus, the vector of the first-order derivatives (partial derivatives) is typically referenced to as the gradient of the target function.
Gradient: Vector of partial derivatives of a targeted function with regard to input variables.
The gradient descent algorithm needs the calculation of the gradient of the target function with regard to the particular values of the input values. The gradient points uphill, thus the negative of the gradient of every input variable is followed downhill to have the outcome of new values for every variable that has the outcome of a lower evaluation of the target function.
A step size is leveraged to scale the gradient and manages how much to modify every input variable with regard to the gradient.
Step size: Learning rate or alpha, a hyperparameter leveraged to manage how much to alter every input variable with regard to the gradient.
This procedure is rinsed and repeated until the minimum of the target function is situated, a maximum number of candidate solutions are evaluated, or some other stop condition.
Gradient descent can be adapted to reduce the loss function of a predictive model on a training dataset, like a classification or regression model. This adaptation is referred to as stochastic gradient descent.
Stochastic gradient descent: Extension of the gradient descent optimization algorithm for reducing a loss function of a predictive model on a training dataset.
The target function is taken as the loss or error function on the dataset, like mean squared error for regression or cross-entropy for classification. The parameters of the model are taken as the input variables for the target function.
Loss function: target function that is being minimized.
Model parameters: input parameters to the loss function that are being optimized.
The algorithms are referenced to as “stochastic” as the gradients of the target function with regard to the input variables are noisy (for example, a probabilistic approximation). This implies that the evaluation of the gradient might possess statistical noise that may obscure the true underlying gradient signal, created due to the sparseness and noise within the training dataset.
The insight of stochastic gradient descent is that the gradient is an expectation. The expectation might be approximately estimated leveraging a minimal grouping of samples.
Stochastic gradient descent can be leveraged to train (optimize) several differing model types, like linear regression and logistic regression, even though often more effective optimization algorithms have been discovered and ought to be leveraged instead.
Stochastic gradient descent (SGD) and its variations are likely the most leveraged optimization algorithms for machine learning in general and for deep learning in specific.
Stochastic gradient descent is the most effective algorithm discovered for the training of artificial neural networks, where the weights are the model parameters and the target loss function is the prediction error averaged over one, a subset (batch) of the complete training dataset.
Nearly all of deep learning is driven by one very critical algorithm: stochastic gradient descent or SGD.
There are several widespread extensions to stochastic gradient descent developed to enhance the optimization procedure (same or better loss in reduced iterations), like Momentum, Root Mean Squared Propagation (RMSProp) and Adaptive Movement Estimation (Adam).
A challenge when leveraging stochastic gradient descent in training a neural network is how to calculate the gradient for nodes in hidden layers within the network: for example, nodes one or more steps away from the output layer of the model.
This needs a particular strategy from calculus referred to as the chain rule and an efficient algorithm that implements the chain rule that can be leveraged to calculate gradients for any parameter in the network. This algorithm is referred to as back-propagation.
Back-propagation, also referred to as ‘backpropagation’ or merely ‘backdrop’, is an algorithm for calculating the gradient of a loss function with regard to variables of a model.
Back-propagation: Algorithm for calculating the gradient of a loss function with regard to variables of a model.
You might remember from calculus that the first-order derivative of a function for a particular value of an input variable is the rate of change or curvature of the function for that input. When we possess several input variables for a function, they form a vector and the vector of first-order derivatives (partial derivatives) is referred to as the gradient (that is, vector calculus)
Gradient: Vector of partial derivatives of particular input values with regards to a target function.
Back-propagation is leveraged when training neural networks models to calculate the gradient for every weight in the network model. The gradient is then leveraged by an optimization algorithm to update the model weights.
The algorithm was produced overtly for calculating the gradients of variables in graph structures operating backward from the output of the graph toward the input of the graph, propagating the error in the predicted output that is leveraged to calculate gradient for every variable.
The loss function indicates the error of the model or error function, the weights are the variables for the function, and the gradients of the error function with regard to the weights are thus referenced to as error gradients.
Error function: Loss function that is minimized when training a neural network.
Weights: Parameters of the network taken as input values to the loss function.
Error gradients: First-order derivatives of the loss function with regard to the parameters.
This provides the algorithm its name “back-propagation”, or sometimes “error back-propagation” or the “back-propagation of error”
Back-propagation of error: Comment on how gradients are calculated recursively backward through the network graph beginning at the output layer.
The algorithm consists of the recursive application of the chain rule from calculus (different from the chain rule from probability) that is leveraged to calculate the derivative of a sub-function provided the derivative of the parent function for which the derivative is known.
Chain rule: Calculus formula for calculating the derivatives of functions leveraging connected functions whose derivatives are known.
There are other algorithms for calculating the chain rule, however, the back-propagation algorithm is an effective algorithm for the particular graph structured leveraging a neural network.
It is fair to refer to the back-propagation algorithm detailed here is only one strategy to automatic differentiation. It is a special case of a broader class of strategies referred to as reverse mode accumulation.
Even though back-propagation was designed to train neural network models, both the back-propagation algorithm particularly and the chain-rule formula that it implements effectively can be leveraged more generally to calculate derivatives of functions.
Further, back-propagation is usually misunderstood as being particular to multi-layer neural networks, but in principle it can compute derivatives of any function.
Stochastic Gradient Descent with Back-propagation
Stochastic gradient descent is an optimization algorithm that can be leveraged to train neural network models.
The Stochastic Gradient Descent algorithm needs gradients to be calculated for every variable in the model so that new values for the variables can be calculated.
Combined, the back-propagation algorithm and Stochastic Gradient Descent algorithm can be leveraged to train a neural network. We might call this “Stochastic Gradient Descent with Back-propagation”
Stochastic gradient descent with back-propagation: A more complete description of the general algorithm leveraged to train a neural network, referencing the optimization algorithm and gradient calculation algorithm.
It is typical for practitioners to state they train their model leveraging back-propagation. Technically, that is wrong. Even as a short-hand, this would still be wrong. Back-propagation is not an optimization algorithm and cannot be leveraged to train a model.
It would be common sense to state that a neural network is trained or learns leveraging Stochastic Gradient Descent as a shorthand, as it is assumed that the back-propagation algorithm is leveraged to calculate gradients as part of the optimization procedure.
That being stated, a different algorithm can be leveraged to optimize the parameter of a neural network, like a genetic algorithm that does not need gradients. If the Stochastic Gradient Descent optimization is leveraged, a different algorithm can be leveraged to calculate the gradients for the loss function with regard to the model parameters, like alternate algorithms that implement the chain rule.
Nonetheless, the “Stochastic Gradient Descent with Back-propagation” combo is broadly leveraged as it is the most effective and efficient general approach thus far produced for fitting neural network models.