### An intro to gradient descent procedure

Gradient descent procedure is a strategy that places critical importance in machine learning. It is usually leveraged for minimization of error functions within classification and regression issues. It is also leveraged in training neural networks, and deep learning architectures.

In this guide, you will find out about the gradient descent procedure.

After going through this guide, you will be aware of:

- Gradient Descent Method
- Criticality of Gradient Descent Within Machine Learning

__Tutorial Summarization__

This guide is subdivided into two portions, which are:

1] Gradient Descent Procedure

2] Solved instance of gradient descent procedure

__Prerequisites__

For this guide, the prior know-how of the following subjects is assumed:

- A function of several variables
- Partial derivatives and gradient vectors

You can review these subjects in our comprehensive blog section.

__Gradient Descent Procedure__

The gradient descent procedure is an algorithm for identifying the minimum of a function.

Assume we possess a function f(x), where x is a tuple of various variables, that is, x = (x_1, x_2, x_n). Also, assume that the gradient of f(x) is provided by ∇f(x). We wish to identify the value of the variables (x_1, x_2, … x_n) that provide us the minimum of the function. At any iteration t, we will signify the value of the tuple x by x[t]. Therefore x[t][1] is the value of x_1 at iteration t, x[t][2] is the value of x_2 at iteration t, e.t.c

__The Notation__

We possess the following variables.

- t = iteration number
- T = Total iterations
- n = Total variables in the domain of f (also referred to as the dimensionality of x)
- j = Iterator for variable number, e.g., x_j indicates the jth variable
- n = Learning rate
- ∇f(x[t]) = Value of the gradient vector of f at iteration t

__The Training Method__

The stages for the gradient descent algorithm are provided below. This is also referred to as the training method.

- Opt for an arbitrary initial point x_initial and setx[0] = x_initial
- For iterations t=1..T
- Update x(t) = x [t-1] – 𝜂∇f(x[t-1])

It’s as easy as that.

The learning pace is a user defined variable for the gradient descent process. Its value is in the range [0,1]

The above-specified strategy states that at every iteration we have to go about updating the value of x by taking a small step in the path of the negative of the gradient vector. If 𝜂=0 then there will be no alteration in x. If it is equal to 1, then it can be compared to taking a major step in the path of the negative of the gradient of the vector. Normally, 𝜂 is set to a minimal value such as 0.05 or 0.1. It can also be variable during the training procedure. So your algorithm can begin with a large value (e.g. 0.8) and the minimize it to lesser values.

__Instance of Gradient Descent__

Let’s identify the minimum of the following function of dual variables whose graphs and contours are demonstrated in the image underneath:

f(x,y) = x**x + 2y**y

The general form of the gradient vector is provided by:

∇f(x,y) = 2xi + 4yj

Two iterations of the algorithm, T=2 and 𝜂=0.1 are demonstrated below

- Initial t=0
- x[0] = (4,3) # This is just a randomly chosen point

- At t = 1
- x[1] = x[0] – η∇f(x[0])
- x[1] = (4,3) – 0.1*(8,12)
- x[1] = (3.2,1.8)

- At t=2
- x[2] = x[1] – η∇f(x[1])
- x[2] = (3.2,1.8) – 0.1*(6.4,7.2)
- x[2] = (2.56,1.08)

If you persist in executing the above-specified iterations, the process will ultimately wind up at the point where the function is minimum, that is, (0,0)

At iteration t=1, the algorithm is demonstrated in the figure below:

__How many iterations to execute__

Typically, gradient descent is executed till the value of x does not alter or the alteration in x is below a specific threshold. The ceasing criterion can also be a user defined maximal number of iterations – that we gave definition to, prior, as T

__Adding Momentum__

Gradient descent can face issues like:

1] Oscillate amongst two or more points

2] Get trapped in a local minimum

3] Overshoot and miss the minimum point

To tackle the above-specified issues, a momentum term can be included to the update equation of gradient descent algorithm as:

x[t] = x[t-1] – η∇f(x[t-1]) + ?*Δx[t-1]

where Δx[t-1] represents the change in x, i.e.,

Δx[t] = x[t] – x[t-1]

The initial change at t=0 is a zero vector. For this problem Δx[0] = (0,0).

__Regarding Gradient Ascent__

There is a connected Gradient Ascent process, which identifies the maximum of a function. In gradient descent we adhere to the direction of the rate of maximal decrease of a function. Within gradient descent, we adhere to the direction of the rate of maximum decrease of a function. It is the path of the negative gradient vector. While, within gradient ascent we adhere to the path of maximum rate of increase of a function, which is the direction indicated to by the positive gradient vector. We can also author a maximisation issue in terms of a maximisation problem by including a negative sign to f(x) i.e.,

maximize f(x) w.r.t x is equivalent to minimize -f(x) w.r.t x

__Why is the Gradient Descent critical within Machine Learning?__

The gradient descent algorithm is typically deployed within machine learning issues. In several classification and regression activities, the mean square error function is leveraged to fit a model to the data. The gradient descent procedure is leveraged to detect the optimal model parameters that cause the lowest mean square error.

Gradient ascent is leveraged in much the same way, for issues that consist of maximization of a function.

__Extensions__

This section details some concepts and ideas for extension of the guide that you would desire to explore.

- Hessian Matrix – Hessian, by design, is the second order partial derivatives of a function.

Take up a set of y=f(x)y=f(x) on nn equations in nn variables. a Jacobian matrix JJ is

And Hessian is a Jacobian matrix of the derivatives **∂f****/****∂x****1****∂f/∂x1****, ∂f/∂x2∂f/∂x2, …, ∂f/∂xn**

Wolfram – Hessian

Where do we leverage this?

Take up the following:

**fxx****(****x****,****y****)****fxx(x,y)**

fyy(x,y)fyy(x,y)

and **fxy****=****∂****2****f∂x∂y**

There’d be no issue comprehending fxx(x,y)fxx(x,y) and fyy(x,y). It indicates the concavity and convexity of *f *in the direction *x *and *y*. The geometric meaning of fxy(x,y) is a bit difficult to imagine. The object that truly has the geometric meaning is the **Hessian H, **where

Hessian can be leveraged to undertake analysis of the crucial points (gradient =0) within a function. If you’re at a critical point, the determinant of ** H **enables you to determine what the level sets are, if they’re positive definite, negative definite, or indefinite. The eigenvalues of

**t a crucial point, informs us if a point is local maximum (-ve eigenvalues) or local minimum (+ve). The Eigen Vectors provide the direction.**

*H,*- Jacobian – A Jacobian matrix, at times merely referred to as a Jacobian, is a matrix of first-order partial derivatives (in some scenarios, the term “Jacobian” also is in reference to the determinant of the Jacobian matrix.)

For a function: **f**he Jacobian is the following ** m x n **matrix.

Or, in Einstein notation,

Observe that in a few conventions, the Jacobian is the transpose of the above matrix.

Jacobians where m=n are square matrices, and are typically leveraged when modifying coordinates, particularly when taking several integrals and deciding if complicated functions are holomorphic. For instance, a Jacobian indicating a modification in variables from x to x(u,v) and y to y(u,v) in two dimensions is indicated as:

A Jacobian matrix is what is typically meant by the derivative of higher-dimensional functions, indeed differentiability in the components of a Jacobian ensures differentiability in the input function itself. In the scenario of a multivariate function, the Jacobian matrix with regards to the input variables is merely the gradient of the function. The Jacobian is also connected to the Hessian matrix by

__Applications__

Jacobian matrices are good for integrating when modifying coordinate systems. For instance, provided a two dimensional coordinate transformation, the double integral of f(x,y) turns into:

When operating with one independent variable, this turns into:

Which, when leveraged to compute an integral, provides the formula referred to as integration by substitution.

__Instances__

Jacobian matrices are good in integrating when modifying coordinate systems. For instance, provided a two dimensional coordinate transformation, the double integral of f(x,y) becomes:

When operating with a singular independent variable, this becomes:

Which, when leveraged in computing an integral, provides the formula referred to as integration by substitution. Find the area of a circle of radius by transformation from Cartesian coordinates to Polar coordinates.

As , the Jacobian determinant becomes

The integral now turns

Now we can go about integrating it.

As it is currently in polar coordinates, we can add the bonus as

This is to be expected. We have also discovered that the differential of

This same strategy can be leveraged to identify the volume of a sphere in the spherical coordinates. Since

The Jacobian determinant evaluates to:

This implies that for any function in spherical coordinates, the volume element is

The integral is thus,

__Other subjects worth exploring into__

- Derivatives
- Slopes and tangents
- Gradient descent for machine learning
- What is gradient in machine learning
- Partial derivatives and gradient vectors

__Conclusion and summarization__

In this blog post guide by AICoreSpot, you found out about the algorithm with regards to gradient descent. Particularly, you learned:

- Gradient descent procedure
- How to go about applying gradient descent procedure to identify the minimum of a function.
- How to transform a maximisation issue into a minimisation issue