Higher-order derivatives
Higher-order derivatives can gather data with regards to a function that first-order derivatives on their own couldn’t possibly capture.
First-order derivatives can gather critical data, like the pace of change, however, on their own, they are incapable of distinguishing amongst local maxima or minima, where the rate of change is nil for both. Various optimization algorithms tackle this restriction through exploitation of the use of higher-order derivatives, like in Newton’s method where the second-order derivatives are leveraged to attain the local minimum of an optimization function.
In this guide, you will find out how to go about computing higher-order univariate and multivariate derivatives.
After going through this guide, you will be aware of:
- How to go about computing the higher-order derivatives of univariate functions.
- How to go about computing the higher-order derivative of multivariate functions.
- How the second-order derivatives can undergo exploitation within machine learning by second-order optimization algorithm.
Tutorial Summarization
This tutorial is subdivided into three portions, which are:
1] Higher-order derivatives of Univariate functions
2] Higher-order derivatives of Multivariate functions
3] Application within Machine Learning
Higher-order Derivatives of Univariate Functions
In addition to first-order derivatives, which we have observed can furnish us with critical data with regards to a function, like its instantaneous pace of change, higher-order derivatives can also be considerably useful. For instance, the second derivative can measure the acceleration of a moving object, or it can assist an optimization algorithm distinguish amongst a local maximum and a local minimum.es
Computation of higher-order (second, third, or higher) derivatives of univariate functions is not that tough:
The second derivative of function is only the derivative of its first derivative. The third derivative is the derivative of the second derivative, the fourth derivative is the derivative of the third, and so on:
Therefore, computation of higher-order derivatives merely consists of differentiating the function consistently. In order to do this, we can merely apply our know-how of the power rule. Let’s take up the function, f(x) = x cubed + 2x squared – 4x + 1, as an instance. Then
- First derivative: f(x) = 3x squared + 4x – 4
- Second derivative:f’(x) = 6x + 4
- Third derivative: f’’’(x) = 6
- Fourth derivative: f to the power of 4 (x) = 0
- Fifth derivative: f to the power of 5 (x) = 0 etc.
What we have carried out here is that we possess first applied the power rule to f(x) to get its first derivative f(x), then applied the power rule to the starting derivative in order to get the second, and so on. The derivative, will, ultimately go to zero as differentiation has its application repeatedly.
The application of the product and quotient rules also stays valid in getting higher-order derivatives, but their computation can become trickier and trickier as the order increases. The general Leibniz rule makes the task easier in this regard, through the generalization of the product rule to:
Here, the term n!/k!(n-k)1=! is the binomial coefficient from the binomial theorem while f to the power of k and g to the power of k signify the kth derivative of the functions, f, and g, respectively.
Thus, identifying the first and second derivatives (and, therefore, replacing for n=1 and n=2, respectively), by the general Leibniz rule, provides us:
(fg) to the power (1) = (fg)’ = f to the power of (1) g + f g to the power of (1)
(fg) to the power of (2) = (fg)’’ = f to the power of (2) g + 2f to the power of (1) g to the power of (1) + f g to the power of (2)
Thus, identifying the first and second derivatives (and, therefore, replacing for n=1 and n=2, respectively, through the general Leibniz rule, provides us:
(fg)(1) = (fg)’ = f (1) g + f g(1)
(fg)(2) = (fg)’’ = f (2) g + 2f (1) g(1) + f g(2)
Observe the familiar first derivative as defined by the product rule. The Leibniz rule can additionally be leveraged to identify higher-order derivatives of rational functions, as the quotient can be basically expressed into a product of the form, f g to the power of -1.
Higher-order derivatives of multivariate functions
The definition of higher-order partial derivatives of multivariate functions is analogous to the univariate case: the nth order partial derivative for n > 1, receives computation as the partial derivative of the (n-1)th order partial derivative. For instance, taking up the second partial derivative of a function with two variables have the outcome of four, second partial derivatives: two own partial derivatives fxx and fyy, and two cross partial derivatives, fxy and fyx.
To take up a ‘derivative’ we must take up a partial derivative with regards to x or y, and there are four ways to go about it: x, then x then y, y then x, y then y
Let’s take up the multivariate function f(x,y) = x squared + 3xy + 4y squared, for which we would desire to identify the second partial derivatives. The procedure begins with identifying its first-order partial derivatives
The four, second-order partial derivatives are then identified by repeating the procedure of identifying the partial derivatives, of the partial derivatives. The own partial derivatives are the most easy to identify, as we merely repeat the partial differentiation process, with regards to either x or y, a second time:
The cross partial derivative of the prior identified fx (that is, the partial derivative with regards to x) is identified by taking the partial derivative of the outcome with regards to y, giving us fxy. Likewise, taking the partial derivative of fy with regards to x, provides us fyx:
It is not accidental that the cross partial derivatives provide the same outcome. This receives definition by Clairaut’s theorem, which specifies that as long as the cross partial derivatives are continuous then they are equivalent.
Application within Machine Learning
Within machine learning, it is the second-order derivative that is majorly leveraged. We had prior specified that the second derivative can furnish us with data that the first derivative on its own cannot gather. Particularly it can inform us if a critical point is a local minimum or maximum (on the basis of if the second derivative is bigger or smaller than zero, respectively), for which the first derivative would, otherwise be nil in both scenarios.
There are various second-order optimization algorithms that harness this data, one which is Newton’s method.
Second-order data, on the other hand, facilitates us to make a quadratic approximation of the objective function and approximate the right step size to attain a local maximum.
In the univariate scenario, Newton’s method leverages a second-order Taylor series expansion to execute the quadratic approximation at some point on the objective function. The update rule for Newton’s method, which is gotten by setting the derivative to zero and solving for the root, consists of a division operation by the second derivative. If Newton’s method receives extension to multivariate optimization, the derivative is substituted by the gradient, whereas the reciprocal of the second derivative is substituted with the inversion of the Hessian matrix.
We will looking at the Hessian and Taylor approximations which harness the use of higher-order derivatives, in other tutorials.