The chain rule of calculus for univariate and multivariate functions
The chain rule enables us to identify the derivative of composite functions.
It is computed comprehensively by the backpropagation algorithm in order to train feedforward neural networks. Through application of the chain rule in an effective fashion while following a particular order of operations, the backpropagation algorithm calculates the error gradient of the loss function with regard to every weight of the network.
In this guide, you will find out the chain rule of calculus for univariate and multivariate functions.
After going through this guide, you be aware of:
- A composite function is the combo of two or additional functions
- The chain rule enables us to identify the derivative of a composite function
- The chain rule can be generalized to multivariate functions, and indicated by a tree diagram
- The chain rule is then applied comprehensively by the backpropagation algorithm in order to calculate the error gradient of the loss function with regard to every weight.
Tutorial Summarization
This guide is subdivided into four portions, which are:
- Composite Functions
- The Chain Rule
- The Generalized Chain Rule
- Application within Machine Learning
Prerequisites
For this guide, knowledge on these domains is assumed.
- Multivariate functions
- The Power Rule
- The gradient of a function
Composite Functions
We have thus far, met functions of singular and multiple variables (so called, univariate and multivariate functions, respectively). We shall now extend both of them to their composite forms. We will, ultimately, see how to go about applying the chain rule in order to identify their derivative, but more on this shortly.
A composite function is the combo of two functions.
Take up two functions of a singular independent variable, f(x) = 2x – 1 and g(x) = x cubed. Their composite function can be defined as follows:
h = g(f(x))
In this operation, g is a function of f. This implies that g is applied to the outcome of application of the function f, to x, generating h.
Let’s take up a concrete instance leveraging the functions mentioned above to comprehend this better.
Assume that f(x) and g(x) are two systems in cascade, obtaining an input x = 5
Since f(x) is the initial system in the cascade (as it is the inner function in the composite), its output is worked out to start with:
f(5) = (2 x 5) – 1 = 9
This outcome is then passed on as an input to g(x), the second system within the cascade (as it is the outer function in the composite) to generate the net outcome of the composite function:
g(9) = 9 cubed = 729
We could, alternatively, computed the net outcome at one shot, if we had performed the following computation:
h = g(f(x)) = (2x – 1) cubed = 729
The composition of functions can also be viewed as a chaining process, to leverage a more familiar term, where the output of a single function feeds into the next one within the chain.
With composite functions, the order matters.
Remember that the composition of functions is a non-commutative procedure, which implies that swapping the order of f(x) and g(x) in the cascade (or chain) does not generate the same outcomes. Therefore:
g(f(x)) is not equal to f(g(x))
The composition of functions can additionally be extended to the multivariate case:
h = g(r,s,t) = g(r(x,y) s(x,y), t(x,y)) = g(f(x,y))
Here, f(x,y) is a vector-valued function of two independent variables (or inputs) x and y. It is composed of three components (for this specific instance) that are r(x,y) s(x,y) and t(x,y) and which are also known as the component functions of f.
This implies that f(x,y) will map two inputs to three outputs, and will then feed these three outputs into the consecutive system within the chain g(r,s,t) to generate h.
The Chain Rule
The chain rule enables us to identify the derivative of a composite function.
Let’s first give definition to how the chain rule determines a composite function, and then break it into its individual components to comprehend it better. If we had to consider again the composite function, h=g(f(x)); then its derivative as provided through the chain rule is:
Here, u is the output of the inner function f (therefore, u = f(x)), which then provided as input to the subsequent function g to generate h (therefore, h = g(u)). Notice, thus, how the chain rule relates to the net output, h to the input, x via an intermediate variable, u.
Remember that the composite function is defined as follows:
h(x) = g(f(x)) = (2x – 1) cubed
The first component of the chain rule, dh l du, informs us to begin by identifying the derivative of the outer portion of the composite function, while ignoring whatever is inside. For this reason, we will apply the power rule.
((2x – 1)cubed)’ = 3(2x – 1) squared
The outcome is then multiplied to the second component of the chain rule, du l dx, which is the derivative of the inner part of the composite function, this time ignoring whichever is outside.
( (2x – 1)’) cubed = 2
The derivative of the composite function as defined by the chain rule is, then, the following:
h’ = 3(2x – 1) squared x 2 = 6(2x -1) squared
We have, hereby viewed a simple instance, but the idea of application of the chain rule to more complex functions stays the same. We shall be viewing more challenging functions in another tutorial.
The Generalized Chain Rule
We can generalize the chain rule beyond the univariate case.
Take up the scenario where x ∈ ℝm and u ∈ ℝn which implies that the inner function, f, maps m inputs to n outputs, whereas the outer function, g gathers n inputs to generate an output, h. For i = 1, …., m the generalized chain rule specifies:
Or in its more concise form, for j =1, …, n:
Remember that we employ the leveraging of partial derivatives when we are identifying the gradient of a function of multiple variables.
We can additionally visualize the workings of the chain rule by a tree diagram.
Assume that we possess a composite function of dual independent variables, x1 and x2 defined as follows:
h = g(f(x1, x2)) = g(u1(x1, x2), u2(x1, x2))
Here, u1 and u2 function as the intermediate variables. Its tree diagram would be indicated as follows:
In order to obtain the formula for every one of the inputs, x1 and x2 we can begin from the left hand side of the tree diagram, and follows its branches rightwards. In this fashion, we identify that we form the following two formulae (the branches being totalled up have been colour coded for simplicity’s sake):
Observe how the chain rule relates the net output, h, to every one of the inputs, xi through the intermediate variables uj. This is an idea that the backpropagation algorithm applies extensively to optimize the weights of a neural network.
Application within Machine Learning
Look at how similar the tree diagram is to the usual representation of a neural network (even though we typically represent the latter by putting the inputs on the left hand side and the outputs on the right hand side.) We can go about applying the chain rule to a neural network through the leveraging of the backpropagation algorithm, in a really similar fashion as to how we have applied it to the tree diagram above.
An area where the chain rule is leveraged to an extreme is deep learning, where the function value y is computed as a many-level function composition.
A neural network, as a matter of fact, can be indicated by a massive nested composite function. For instance:
y = fK ( fK – 1 ( … ( f1(x)) … ))
Here, x are the inputs to the neural network (for instance, the images) while y are the outputs (for instance, the class labels) Every function fi, for i = 1, …, K is characterized through its own weights.
Application of the chain rule to such a composite function enables us to work backwards via all of the hidden layers making up the neural network, and effectively calculate the error gradient of the loss function with regard to every weight, wi of the network till we get to the input.