Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that it can get stuck in flat areas or bounce around if the objective function gives back noisy gradients. Momentum is a strategy that accelerates the progression of the search to skim across flat regions and smooth out bouncy gradients.

In some scenarios, the acceleration of momentum can cause the search to miss or overshoot the minima at the bottom of basins or valleys. Nesterov Momentum is an extension of momentum that consists of calculating the decaying moving average of the gradients of projected positions within the search space instead of the actual positions themselves.

This has the influence of leveraging the accelerating advantages of momentum while enabling the search to slow down when approaching the optima and minimize the probability of missing or overshooting it.

In this guide, you will find out how to develop the Gradient Descent optimization algorithm with Nesterov Momentum from the ground up.

After going through this guide, you will be aware of:

- Gradient descent being an optimization algorithm that leverages the gradient of the objective function to navigate the search space.
- The convergence of gradient descent optimization algorithm can be accelerated through extension of the algorithm and including Nesterov Momentum.
- How to implement the Nesterov Momentum optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.

__Tutorial Summarization__

This tutorial is subdivided into three portions, which are:

1] Gradient Descent

2] Nesterov Momentum

3] Gradient Descent with Nesterov Momentum

- Two-Dimensional Test Problem
- Gradient descent optimization with Nesterov Momentum
- Visualization of Nesterov Momentum

__Gradient Descent__

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly leverages the first order derivative of the targeted objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular input.

If the target functions takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be perceived as a vector and is referenced to typically as the “gradient”.

Gradient: First order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent references to a minimization optimization algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimized and the derivative function for the objective function. The targeted function(f) gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point in the input space.

The derivative is subsequently calculated and a step is undertaken within the input space that is predicted to have the outcome of a downhill movement within the targeted function, under the assumption we are minimizing the target function.

A downhill movement is performed by first calculating how far to shift within the input space, calculated as the steps size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we shift against the gradient, or down the targeted function.

- x(t+1) = x(t) – step_size * f'(x(t))

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step undertaken is scaled leveraging a step size hyperparameter.

Step size (alpha): Hyperparameter that manages how far to shift in the search space against the gradient every iteration of the algorithm.

If the step size is too minimal, the movement within the search space will be minimal, and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s observe Nesterov momentum.

__Nesterov Momentum__

Nesterov Momentum is an extension to the gradient descent optimization algorithm.

The strategy was detailed by (and named for) Yurii Nesterov in his 1983 paper entitled “A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k^2)

Ilya Sutskever, et al. are accountable for making widespread the application of Nesterov Momentum with regards to the training of neural networks with stochastic gradient descent detailed in their 2013 research paper, “On the importance of initialization and momentum in deep learning.” They refer to the approach as “Nesterov’s Accelerated Gradient” or NAG for short.

Nesterov Momentum is not usually viewed as a variant of momentum, it indeed happens to be closely connected to classical momentum, varying only in the precise update of the velocity vector.

Conventional momentum consists of maintenance of an extra variable that indicates the last update performed to the variable, an exponentially decaying shifting average of past gradients.

The momentum algorithm accumulates an exponentially decaying moving average of historical gradients and goes on to shift in their direction.

This final update or final change to the variable is the included to the variable scaled by a “momentum” hyperparameter that manages how much of the final change to include, e.g., 0.9 for 90%.

It is simpler to think about this update in terms of dual steps, for example, calculate the modification in the variable leveraging the partial derivative then calculate the fresh value for the variable.

- change(t+1) = (momentum * change(t)) – (step_size * f’(x(t)))
- w(t+1) = x(t) + change(t+1)

We can perceive momentum in terms of a ball rolling down the hill that will speed up and continue traversing in the same direction even in the existence of small hills.

Momentum can be interpreted as a ball rolling down a nearly horizontal incline. The ball naturally gathers momentum as gravity is the reason behind its acceleration, just like the gradient is the reason behind momentum to accumulate in this descent method.

An issue with momentum is that acceleration can at times be the reason behind the search to overshoot the minima at the bottom of a basin or valley floor.

Nesterov Momentum can be viewed of as a modification to momentum to surpass this issue of overshooting the minima.

It consists of first calculating the projected location of the variable leveraging the modification from the last iteration, and leveraging the derivative of the projected position in the calculation of the new position for the variable.

Calculating the gradient of the projected position functions like a correction factor for the acceleration that has been accumulated.

With Nesterov Momentum the gradient is assessed after the current velocity is applied. Therefore one can interpret Nesterov momentum as making an effort to include a correction factor to the conventional method of momentum.

Nesterov Momentum is simple to think about this is terms of the four steps:

1] Project the location of the solutions.

2] Caculate the gradient of the projection

3] Calculate the change in the variable leveraging the partial derivative.

4] Update the variable

Let’s observe these individual steps in comprehensive detail.

To start with, the projected position of the complete solution is calculated leveraging the change calculated in the final iteration of the algorithm.

- Projection(t+1) = x(t) + (momentum * change(t))

We can then calculate the gradient for this new position.

- Gradient(t+1) = f’(projection(t+1))

Now we can calculate the new position of every variable leveraging the gradient of the projection, to start with by calculating the modification in every variable.

- Change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

And lastly, calculating the fresh value for every variable leveraging the calculated change.

x(t+1) = x(t) + change(t+1)

In the domain of convex optimization more generally, Nesterov Momentum is known to enhance the rate of convergence of the optimization algorithm (for example, minimize the number of iterations needed to identify the solution)

Much like momentum, NAG is a first-order optimization strategy with improved convergence rate guarantee than gradient descent in specific situations.

Even though the strategy is efficient in training of neural networks, it might not have the same general effect of accelerating convergence.

Unluckily, in the stochastic gradient scenario, Nesterov momentum does not enhance the rate of convergence.

Now that we are acquainted with the Nesterov momentum algorithm, let’s look into how we might go about implementing it and assess its performance.

__Gradient Descent with Nesterov Momentum__

To start with, let’s define an optimization function.

We will leverage a simple two-dimensional functions that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective() function below goes about implementing this function:

1 2 3 | # objective function def objective(x, y): return x**2.0 + y**2.0 |

We can develop a 3D plot of the dataset to obtain a feeling for the curvature of the response surface.

The full instance of plotting the objective function is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot
# objective function def objective(x, y): return x**2.0 + y**2.0
# define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection=’3d’) axis.plot_surface(x, y, results, cmap=’jet’) # show the plot pyplot.show() |

Running the instance develops a 3D surface plot of the objective function.

We can see the familiar bowl shape possessing the global minima at f(0,0) = 0.

We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below develops a contour plot of the objective function.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot
# objective function def objective(x, y): return x**2.0 + y**2.0
# define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) # show the plot pyplot.show() |

Running the instance develops a 2D contour plot of the objective function.

We can observe the classical bowl shape compressed to contours displayed with a color gradient. We will leverage this plot to plot the particular points looked into during the progression of the search.

Now that we possess a test objective function, let’s observe how we might implement the Nesterov Momentum optimization algorithm.

__Gradient Descent Optimization with Nesterov Momentum__

We can go about applying the gradient descent with Nesterov Momentum to the test problem.

To start with, we require a function that calculates the derivatives for this function.

The derivative of x^2 is x * 2 in every dimension and the derivative() function implements this below.

1 2 3 | # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) |

Then, we can go about implementing gradient descent optimization.

Then, we can implement gradient descent optimization.

To start with, we can choose an arbitrary point in the bounds of the problem as a beginning point for the search.

This assumes we have a plethora that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

1 2 3 | … # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) |

Then, we are required to calculate the projected point from the present position and calculate its derivative.

1 2 3 4 5 | … # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) |

We can then develop the new solution, a single variable at a time.

To start with the, the modification in the variable is calculated leveraging the partial derivative and learning rate with the momentum from the final change in the variable. This modification is recorded for the next iteration of the algorithm. Then the modification is leveraged to calculate the new value for the variable.

1 2 3 4 5 6 7 8 9 10 | … # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) – step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) |

This is then performed for every variable for the objective function, then repeated for every iteration of the algorithm.

This new solution can then be assessed leveraging the objective() function and the performance of the search can be reported.

1 2 3 4 5 6 | … # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval)) |

And that’s it.

We can connect all of this together, into a function referred to as nesterov() that takes on the names of the objective function and the derivative function, an array with the bounds of the domain and hyperparameter values for the cumulative number of algorithm iterations, the learning rate, and the momentum, and returns the final solution and its evaluation.

This complete function is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | # gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) – step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval)) return [solution, solution_eval] |

Note: We have intentionally leveraged lists and imperative coding style rather than vectorized operations for readability. Feel free to adapt the implementation to a vectorization implementation with NumPy arrays for enhanced performance.

We can then go about defining our hyperparameters can call the nesterov() function to optimize our test objective function.

In this scenario, we will leverage 30 iterations of the algorithm with a learning rate of 0.1 and momentum of 0.3. These hyperparameter values were identified following a bit of trial and error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | … # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 30 # define the step size step_size = 0.1 # define momentum momentum = 0.3 # perform the gradient descent search with nesterov momentum best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Connecting all of this together, the complete example of gradient descent optimization with Nesterov Momentum is detailed here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | # gradient descent optimization with nesterov momentum for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) – step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval)) return [solution, solution_eval]
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 30 # define the step size step_size = 0.1 # define momentum momentum = 0.3 # perform the gradient descent search with nesterov momentum best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Running the instance applies the optimization algorithm with Nesterov Momentum to our test issue and reports performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up executing the instance a few times and contrast the mean outcomes.

In this scenario, we can observe that a near optimal solution was identified following after probably 15 iterations of the search, with input values near 0.0 and 0.0 evaluating to 0.0.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | >0 f([-0.13276479 0.35251919]) = 0.14190 >1 f([-0.09824595 0.2608642 ]) = 0.07770 >2 f([-0.07031223 0.18669416]) = 0.03980 >3 f([-0.0495457 0.13155452]) = 0.01976 >4 f([-0.03465259 0.0920101 ]) = 0.00967 >5 f([-0.02414772 0.06411742]) = 0.00469 >6 f([-0.01679701 0.04459969]) = 0.00227 >7 f([-0.01167344 0.0309955 ]) = 0.00110 >8 f([-0.00810909 0.02153139]) = 0.00053 >9 f([-0.00563183 0.01495373]) = 0.00026 >10 f([-0.00391092 0.01038434]) = 0.00012 >11 f([-0.00271572 0.00721082]) = 0.00006 >12 f([-0.00188573 0.00500701]) = 0.00003 >13 f([-0.00130938 0.0034767 ]) = 0.00001 >14 f([-0.00090918 0.00241408]) = 0.00001 >15 f([-0.0006313 0.00167624]) = 0.00000 >16 f([-0.00043835 0.00116391]) = 0.00000 >17 f([-0.00030437 0.00080817]) = 0.00000 >18 f([-0.00021134 0.00056116]) = 0.00000 >19 f([-0.00014675 0.00038964]) = 0.00000 >20 f([-0.00010189 0.00027055]) = 0.00000 >21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000 >22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000 >23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000 >24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000 >25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000 >26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000 >27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000 >28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000 >29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000 Done! f([-3.82285090e-06 1.01504945e-05]) = 0.000000 |

__Visualization of Nesterov Momentum__

We can go about plotting the progress of the Nesterov Momentum search on a contour plot of the domain.

This can furnish an intuition with regards to the progress of the search over the iterations of the algorithm.

We must go about updating the nesterov() function to maintain a listing of all solutions identified during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed here:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) – step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval)) return solutions |

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the best final solution.

1 2 3 4 5 6 7 8 9 10 11 12 13 | … # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.01 # define momentum momentum = 0.8 # perform the gradient descent search with nesterov momentum solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) |

We can then develop a contour plot of the objective function, as prior:

1 2 3 4 5 6 7 8 9 10 | … # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) |

Lastly, we can plot every solution identified during the search as a white dot linked by a line.

1 2 3 4 | … # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) |

Connecting all of this together, the complete example of performing the Nesterov Momentum optimization on the test problem and plotting the outcomes on a contour plot is detailed here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | # example of plotting the nesterov momentum search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nesterov momentum def nesterov(objective, derivative, bounds, n_iter, step_size, momentum): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # list of changes made to each variable change = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate the projected solution projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])] # calculate the gradient for the projection gradient = derivative(projected[0], projected[1]) # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the change change[i] = (momentum * change[i]) – step_size * gradient[i] # calculate the new position in this variable value = solution[i] + change[i] # store this variable new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval)) return solutions
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.01 # define momentum momentum = 0.8 # perform the gradient descent search with nesterov momentum solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) # show the plot pyplot.show() |

Running the instance performs the search as prior, except in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot.

__Further Reading__

This portion of the blog furnishes additional resources on the subject if you are looking to delve deeper.

*Papers*

A method for solving the convex programming problem with convergence rate O(1/k^2), 1983.

On the importance of initialization and momentum in deep learning, 2013

*Books*

- Algorithms for optimization, 2019
- Deep learning, 2016.

*APIs*

- numpy.random.rand.API
- numpy.asarray.API
- Matplotlib API

*Articles*

- Gradient descent, Wikipedia
- Stochastic gradient descent, Wikipedia
- An overview of gradient descent optimization algorithms, 2016

__Conclusion__

In this guide, you found out how to develop the gradient descent optimization with Nesterov Momentum from the ground up.

Particularly, you learned:

- Gradient descent being an optimization algorithm that leverages the gradient of the objective function to navigate the search space.
- The convergence of gradient descent optimization algorithm can be quickened by extension of the algorithm and including Nesterov Momentum.
- How to implement the Nesterov Momentum optimization algorithm from the ground up and go about applying it to an objective function and assess the outcomes.

#### Gradient Descent with Nesterov Momentum from the ground up

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to situate the minimum of the function.

#### The Chain Rule of Calculus – Additional Functions

The chain rule is a critical derivative rule that enables us to operate with composite functions.

#### Iterated local search from the ground up in Python

Iterated local search is a stochastic global optimization algorithm. It consists of the repetitive application of a local search to modified versions of a good solution identified previously.

#### An intro to transduction within machine learning

Transduction or transductive learning are terms you might encounter within applied machine learning. The term is being leveraged with a few applications of recurrent neural networks on sequence prediction issues, like a few problems in the domain of natural language processing.

#### What is Semi-Supervised Learning?

Semi-supervised learning is a learning problem that consists of a minimal number of labelled instances and a big number of unlabelled instances.

#### The chain rule of calculus for univariate and multivariate functions

The chain rule enables us to identify the derivative of composite functions.

#### Essence of stacking ensembles for machine learning

Stacked generalization, or stacking, might be a less widespread machine learning ensemble provided that it details a framework more than a particular model.

#### Combining predictions for ensemble learning

Ensemble methods consist of bringing together the predictions from several models.

#### Gradient Descent Optimization with Nadam from the ground

Gradient descent is an optimization algorithm that adheres to the negative gradient of an objective function in order to situate the minimum of the function.

#### Dynamic ensemble selection (DES) for Classification in Python

Dynamic ensemble selection is an ensemble learning strategy that automatically chooses a subset of ensemble members just in-time when making a prediction.