### Gradient Descent Optimization with Nadam from the ground

Gradient descent is an optimization algorithm that adheres to the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that the progress of the search can slow down if the gradient becomes flat or large curvature. Momentum can be included to gradient descent that integrates some inertia to updates. This can be subsequently enhanced by integrating the gradient of the projected new position instead of the current position, referred to as Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum.

Another restriction of gradient descent is that a singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm that leverages a separate step size for every input variable but may have the outcome of a step size that swiftly decreases to very minimal values.

Nesterov-accelerated Adaptive Moment Estimation, or the Nadam, is an extension of the Adam algorithm that integrates Nesterov momentum and can have the outcome of better performance of the optimization algorithm.

After going through this guide, you will be aware of:

- Gradient descent is an optimization algorithm that leverages the gradient of the objective function to navigate the search space
- Nadam is an extension of the Adam variant of gradient descent that integrates Neterov momentum.
- How to implement the Nadam optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.

__Tutorial Summarization__

This tutorial is subdivided into three portions, which are:

1] Gradient Descent

2] Nadam Optimization Algorithm

3] Gradient Descent with Nadam

- 2D Test problem
- Gradient descent optimization with Nadam
- Visualization of Nadam optimization

__Gradient Descent__

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly leverages the first-order derivative of the targeted objective function.

First-order methods are reliant on gradient data to help direct the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular point.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced to generally as a gradient.

Gradient: First-order derivative for a multivariate objective function

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular point.

Gradient descent is a reference to a minimization optimization algorithm that adheres to the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent needs a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided groupings of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated a step is taken in the input space that is expected to have the outcome of a downhill movement in the targeted function, going by the assumption we are minimizing the target function.

A downhill movement is made by initially calculating how far to shift within the input space, calculated as the steps size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the current point, making sure we shift against the gradient, or down the targeted function.

- x(t) = x(t-1) – step_size*f’(x(t))

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that manages how far to move within the search space against the gradient every iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s take a deeper look at the Nadam algorithm.

__Nadam Optimization Aglorithm__

The Nesterov-accelerated Adaptive Movement Estimation, or the Nadam, algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm to include Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum, which is an enhanced variant of momentum.

More widely, the Nadam algorithm is an extension to the Gradient Descent Optimization algorithm.

The algorithm was detailed in the 2016 paper by Timothy Dozat entitled “Incorporating Nesterov Momentum into Adam”. Even though a version of the paper was authored up in 2015 as a Stanford project report with the identical name.

Momentum adds an exponentially decaying shifting average (first moment) of the gradient to the gradient descent algorithm. This has the impact of smoothing out noisy objective functions and enhancing convergence.

Adam is an extension of gradient descent that includes a first and second moment of the gradient and automatically adapts a learning rate for every parameter that is undergoing optimization. NAG is an extension to momentum where the update is performed leveraging the gradient of the projected update to the parameter instead of the actual current variable value. This has the impact of slowing down the search when the optima is situated instead of overshooting, in some scenarios.

Nadam is an extension to Adam that leverages NAG momentum rather than classical momentum.

We show how to alter Adam’s momentum component to reap benefits of insights from NAG, and then we put forth preliminary evidence indicating that making this substitution enhances the speed of convergence and the quality of the learned models.

Let’s go through every aspect of the algorithm.

Nadam leverages a decaying step size (alpha) and first moment (mu) hyperparameters that can enhance performance. For the purposes of simplicity, we will ignore the aspect for now and assume constant values.

First, we should maintain the first and second moments of the gradient for every parameter being optimized as portion of the search, referenced to as m and n respectively. They are setup to 0.0 at the beginning of the search.

m = 0

n = 0

The algorithm is carried out iteratively over time, t beginning at t=1, and every iteration consists of calculating a new grouping of parameter values x, e.g. going from x(t-1) to x(t).

It’s probably simple to understand the algorithm if we concentrate on updating a single parameter, which generalizes to updating all parameters through vector operations.

To start with, the gradient (partial derivatives) are calculated for the present time step.

- g(t) = f’(x(t-1))

Then, the first moment is updated leveraging the gradient and a hyperparameter “mu”

- m(t) = mu * m(t-1) + (1 – mu) * g(t)

Then the second moment is updated leveraging the “nu” hyperparameter.

- n(t) = nu * n(t-1) + (1 – nu) * g(t)^2

Then, the first moment is bias-corrected leveraging the Nesterov momentum.

- mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))

The second moment is then bias-corrected.

Bias correction is an aspect of Adam and counters the fact that the first and second moments are initialized to nil at the beginning of the search.

- nhat = nu * n(t) / (1-nu)

Lastly, we can quantify the value for the parameter for this iteration.

- x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat

Where Alpha is the step size (learning rate) hyperparameter, sqrt() is the square root function, and eps(epsilon) is a small value like 1e-8 added to prevent a divide by zero error.

In review, there are three hyperparameters for the algorithm, which are:

- alpha: Initial step size (learning rate), a typical value is 0.002
- mu: Decay factor for first moment (beta1 in Adam) a typical value is 0.975
- nu: Decay factor for second moment (beta2 in Adam) a typical value is 0.999

Now, let’s observe how we might implement the algorithm from the ground up in Python.

__Gradient Descent with Nadam__

In this portion of the blog, we will look into how a simple 2D function that square the input of every dimension and defines the range of valid inputs from -1.0 to 1.0:

The objective() function below goes about implementing this function:

1 2 3 | # objective function def objective(x, y): return x**2.0 + y**2.0 |

We can create a 3D plot of the dataset to obtain a feeling for the curvature of the response surface.

The total example of plotting the objective function is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot
# objective function def objective(x, y): return x**2.0 + y**2.0
# define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection=’3d’) axis.plot_surface(x, y, results, cmap=’jet’) # show the plot pyplot.show() |

Running the instance creates a 3D surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0,0) = 0

We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below creates a contour plot of the objective function.

# contour plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# show the plot

pyplot.show()

Running the instance creates a two-dimensional contour plot of the objective function.

We can observe the bowl shape compressed to contours displayed with a colour gradient. We will leverage this plot to plot the particular points looked into during the progression of the search.

Now that we possess a test objective function, let’s look at how we might go about implementing the Nadam optimization algorithm.

__Gradient Descent Optimization with Nadam__

__ __

We can go about applying the gradient descent with Nadam to the test problem.

First, we require a function that calculates the derivative for this function.

The derivative of x^2 is x*2 in every dimension.

f(x) = x^2

f’(x) = x * 2

The derivative() function goes about implementing this below:

1 2 3 | # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) |

Then, we can implement gradient descent optimization with Nadam.

Then, we can choose a random point in the bounds of the issue as a beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

1 2 3 4 | … # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) score = objective(x[0], x[1]) |

Then, we require to initialize the moment vectors.

1 2 3 4 | … # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] |

We then carry out a static number of iterations of the algorithm defined by the “n_iter” hyperparameter.

…

# run iterations of gradient descent

for t in range(n_iter):

…

The first step is to calculate the derivative for the present set of parameters.

1 2 3 | … # calculate gradient g(t) g = derivative(x[0], x[1]) |

Then, we require to perform the Nadam update calculations. We will carry out these calculations a single variable at a time leveraging an imperative programming style for readability.

Practically, it is recommended leveraging NumPy vector operations for efficiency.

1 2 3 4 | … # build a solution one variable at a time for i in range(x.shape[0]): … |

First, we require to calculate the moment vector

1 2 3 | … # m(t) = mu * m(t-1) + (1 – mu) * g(t) m[i] = mu * m[i] + (1.0 – mu) * g[i] |

Then, the second moment vector

1 2 3 | … # n(t) = nu * n(t-1) + (1 – nu) * g(t)^2 n[i] = nu * n[i] + (1.0 – nu) * g[i]**2 |

Then the bias-corrected Nesterov momentum

1 2 3 | … # mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu)) mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu)) |

The bias-correct second moment.

…

# nhat = nu * n(t) / (1 – nu)

nhat = nu * n[i] / (1.0 – nu)

And lastly updating the parameter

1 2 3 | … # x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat |

This is then repeated for every parameter that is undergoing optimization.

At the conclusion of the iteration, we can assess the new parameter values and repot the performance of the search.

1 2 3 4 5 | … # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) |

We can connect all of this together into a function referred nadam() that takes the names of the objective and derivative functions, in addition to the algorithm hyperparameters, and returns the ideal solution identified at the conclusion of the search and its assessments.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | # gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 – mu) * g(t) m[i] = mu * m[i] + (1.0 – mu) * g[i] # n(t) = nu * n(t-1) + (1 – nu) * g(t)^2 n[i] = nu * n[i] + (1.0 – nu) * g[i]**2 # mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu)) mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu)) # nhat = nu * n(t) / (1 – nu) nhat = nu * n[i] / (1.0 – nu) # x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return [x, score] |

We can then go about defining the bounds of the function and the hyperparameters and refer to the function to carry out the optimization.

In this scenario, we will execute the algorithm for fifty iterations with an initial alpha of 0.02, mu of 0.8 and a nu of 0.999, discovered following a little trial and error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | … # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu) |

At the conclusion of the run, we will report the ideal solution identified.

1 2 3 4 | … # summarize the result print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Connecting all of this together, the complete instance of Nadam Gradient Descent applied to our test issue is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | # gradient descent optimization with nadam for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 – mu) * g(t) m[i] = mu * m[i] + (1.0 – mu) * g[i] # n(t) = nu * n(t-1) + (1 – nu) * g(t)^2 n[i] = nu * n[i] + (1.0 – nu) * g[i]**2 # mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu)) mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu)) # nhat = nu * n(t) / (1 – nu) nhat = nu * n[i] / (1.0 – nu) # x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return [x, score]
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam best, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu) print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Running the instance applies the optimization algorithm with Nadam to our test issue and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near-optimal solution was identified after probably 44 iterations of the search, with input values near 0.0 and 0.0 evaluating to 0.0.

1 2 3 4 5 6 7 8 9 10 11 12 13 | … >40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001 >41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001 >42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001 >43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001 >44 f([-0.00011368 -0.00211861]) = 0.00000 >45 f([-0.00011547 -0.00185511]) = 0.00000 >46 f([-0.0001075 -0.00161122]) = 0.00000 >47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000 >48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000 >49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000 Done! f([-5.54299505e-05 -1.00116899e-03]) = 0.000001 |

__Visualization of Nadam Optimization__

We can plot the progress of the Nadam search on a contour plot of the domain.

This can furnish an intuition for the progression of the search across the iterations of the algorithm.

We must go about updating the nadam() function to maintain a listing of all solutions identified during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 – mu) * g(t) m[i] = mu * m[i] + (1.0 – mu) * g[i] # n(t) = nu * n(t-1) + (1 – nu) * g(t)^2 n[i] = nu * n[i] + (1.0 – nu) * g[i]**2 # mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu)) mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu)) # nhat = nu * n(t) / (1 – nu) nhat = nu * n[i] / (1.0 – nu) # x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # store solution solutions.append(x.copy()) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return solutions |

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the ideal final solution.

…

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# steps size

alpha = 0.02

# factor for average gradient

mu = 0.8

# factor for average squared gradient

nu = 0.999

# perform the gradient descent search with nadam

solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)

We can then develop a contour plot of the objective function, as prior.

1 2 3 4 5 6 7 8 9 10 | … # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) |

Lastly, we can plot every solution identified during the search as a white dot linked by a line.

1 2 3 4 | … # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) |

Connecting all of this together, the complete example of performing the Nadam optimization on the test problem and plotting the outcomes on a contour plot is detailed here.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | # example of plotting the nadam search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy import product from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with nadam def nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) score = objective(x[0], x[1]) # initialize decaying moving averages m = [0.0 for _ in range(bounds.shape[0])] n = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(bounds.shape[0]): # m(t) = mu * m(t-1) + (1 – mu) * g(t) m[i] = mu * m[i] + (1.0 – mu) * g[i] # n(t) = nu * n(t-1) + (1 – nu) * g(t)^2 n[i] = nu * n[i] + (1.0 – nu) * g[i]**2 # mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu)) mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu)) # nhat = nu * n(t) / (1 – nu) nhat = nu * n[i] / (1.0 – nu) # x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat x[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat # evaluate candidate point score = objective(x[0], x[1]) # store solution solutions.append(x.copy()) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return solutions
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # steps size alpha = 0.02 # factor for average gradient mu = 0.8 # factor for average squared gradient nu = 0.999 # perform the gradient descent search with nadam solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) # show the plot pyplot.show() |

Running the instance carries out the search as prior, except in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot.

__Further Reading__

This portion of the blog post details additional resources on the subject if you are seeking to delve deeper.

*Papers*

- Incorporating Nesterov Momentum into Adam, 2016
- Incorporating Nesterov Momentum into Adam, Stanford Report, 2015
- A method for solving the convex programming problem with convergence rate 0 (1/k^2), 1983
- Adam: A method for stochastic optimization, 2014
- An overview of Gradient Descent Optimization Algorithms, 2016

*Books*

- Algorithms for optimization, 2019
- Deep learning, 2016

*APIs*

- numpy.random.rand API
- numpy.asarray API
- Matplotlib API

*Articles*

- Gradient Descent, Wikipedia
- Stochastic gradient descent, Wikipedia
- Optimization, Timothy Dozat, GitHub

__Conclusion__

In this guide, you found out how to develop the gradient descent optimization with Nadam from the ground up.

Particularly, you learned:

- Gradient descent being an optimization algorithm that leverages the gradient of the objective function to navigate the search space.
- Nadam is an extension of the Adam version of gradient descent that integrates Nesterov Momentum.
- How to implement the Nadam optimization algorithm from the ground up and apply it to an objective function and assess the results.