Gradient descent is an optimization algorithm that adheres to the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that the progress of the search can slow down if the gradient becomes flat or large curvature. Momentum can be included to gradient descent that integrates some inertia to updates. This can be subsequently enhanced by integrating the gradient of the projected new position instead of the current position, referred to as Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum.

Another restriction of gradient descent is that a singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm that leverages a separate step size for every input variable but may have the outcome of a step size that swiftly decreases to very minimal values.

Nesterov-accelerated Adaptive Moment Estimation, or the Nadam, is an extension of the Adam algorithm that integrates Nesterov momentum and can have the outcome of better performance of the optimization algorithm.

After going through this guide, you will be aware of:

• Gradient descent is an optimization algorithm that leverages the gradient of the objective function to navigate the search space
• How to implement the Nadam optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This tutorial is subdivided into three portions, which are:

• 2D Test problem

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly leverages the first-order derivative of the targeted objective function.

First-order methods are reliant on gradient data to help direct the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular point.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced to generally as a gradient.

Gradient: First-order derivative for a multivariate objective function

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular point.

Gradient descent is a reference to a minimization optimization algorithm that adheres to the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent needs a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided groupings of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated a step is taken in the input space that is expected to have the outcome of a downhill movement in the targeted function, going by the assumption we are minimizing the target function.

A downhill movement is made by initially calculating how far to shift within the input space, calculated as the steps size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the current point, making sure we shift against the gradient, or down the targeted function.

• x(t) = x(t-1) – step_size*f’(x(t))

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that manages how far to move within the search space against the gradient every iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s take a deeper look at the Nadam algorithm.

The Nesterov-accelerated Adaptive Movement Estimation, or the Nadam, algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm to include Nesterov’s Accelerated Gradient (NAG) or Nesterov momentum, which is an enhanced variant of momentum.

More widely, the Nadam algorithm is an extension to the Gradient Descent Optimization algorithm.

The algorithm was detailed in the 2016 paper by Timothy Dozat entitled “Incorporating Nesterov Momentum into Adam”. Even though a version of the paper was authored up in 2015 as a Stanford project report with the identical name.

Momentum adds an exponentially decaying shifting average (first moment) of the gradient to the gradient descent algorithm. This has the impact of smoothing out noisy objective functions and enhancing convergence.

Adam is an extension of gradient descent that includes a first and second moment of the gradient and automatically adapts a learning rate for every parameter that is undergoing optimization. NAG is an extension to momentum where the update is performed leveraging the gradient of the projected update to the parameter instead of the actual current variable value. This has the impact of slowing down the search when the optima is situated instead of overshooting, in some scenarios.

Nadam is an extension to Adam that leverages NAG momentum rather than classical momentum.

We show how to alter Adam’s momentum component to reap benefits of insights from NAG, and then we put forth preliminary evidence indicating that making this substitution enhances the speed of convergence and the quality of the learned models.

Let’s go through every aspect of the algorithm.

Nadam leverages a decaying step size (alpha) and first moment (mu) hyperparameters that can enhance performance. For the purposes of simplicity, we will ignore the aspect for now and assume constant values.

First, we should maintain the first and second moments of the gradient for every parameter being optimized as portion of the search, referenced to as m and n respectively. They are setup to 0.0 at the beginning of the search.

m = 0

n = 0

The algorithm is carried out iteratively over time, t beginning at t=1, and every iteration consists of calculating a new grouping of parameter values x, e.g. going from x(t-1) to x(t).

It’s probably simple to understand the algorithm if we concentrate on updating a single parameter, which generalizes to updating all parameters through vector operations.

• g(t) = f’(x(t-1))

Then, the first moment is updated leveraging the gradient and a hyperparameter “mu”

• m(t) = mu * m(t-1) + (1 – mu) * g(t)

Then the second moment is updated leveraging the “nu” hyperparameter.

• n(t) = nu * n(t-1) + (1 – nu) * g(t)^2

Then, the first moment is bias-corrected leveraging the Nesterov momentum.

• mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))

The second moment is then bias-corrected.

Bias correction is an aspect of Adam and counters the fact that the first and second moments are initialized to nil at the beginning of the search.

• nhat = nu * n(t) / (1-nu)

Lastly, we can quantify the value for the parameter for this iteration.

• x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhat

Where Alpha is the step size (learning rate) hyperparameter, sqrt() is the square root function, and eps(epsilon) is a small value like 1e-8 added to prevent a divide by zero error.

In review, there are three hyperparameters for the algorithm, which are:

• alpha: Initial step size (learning rate), a typical value is 0.002
• mu: Decay factor for first moment (beta1 in Adam) a typical value is 0.975
• nu: Decay factor for second moment (beta2 in Adam) a typical value is 0.999

Now, let’s observe how we might implement the algorithm from the ground up in Python.

In this portion of the blog, we will look into how a simple 2D function that square the input of every dimension and defines the range of valid inputs from -1.0 to 1.0:

The objective() function below goes about implementing this function:

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can create a 3D plot of the dataset to obtain a feeling for the curvature of the response surface.

The total example of plotting the objective function is detailed below:

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Running the instance creates a 3D surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0,0) = 0

We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below creates a contour plot of the objective function.

# contour plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# show the plot

pyplot.show()

Running the instance creates a two-dimensional contour plot of the objective function.

We can observe the bowl shape compressed to contours displayed with a colour gradient. We will leverage this plot to plot the particular points looked into during the progression of the search.

Now that we possess a test objective function, let’s look at how we might go about implementing the Nadam optimization algorithm.

First, we require a function that calculates the derivative for this function.

The derivative of x^2 is x*2 in every dimension.

f(x) = x^2

f’(x) = x * 2

The derivative() function goes about implementing this below:

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

Then, we can choose a random point in the bounds of the issue as a beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

 1234 …# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])score = objective(x[0], x[1])

Then, we require to initialize the moment vectors.

 1234 …# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]

We then carry out a static number of iterations of the algorithm defined by the “n_iter” hyperparameter.

# run iterations of gradient descent

for t in range(n_iter):

The first step is to calculate the derivative for the present set of parameters.

 123 …# calculate gradient g(t)g = derivative(x[0], x[1])

Then, we require to perform the Nadam update calculations. We will carry out these calculations a single variable at a time leveraging an imperative programming style for readability.

Practically, it is recommended leveraging NumPy vector operations for efficiency.

 1234 …# build a solution one variable at a timefor i in range(x.shape[0]):…

First, we require to calculate the moment vector

 123 …# m(t) = mu * m(t-1) + (1 – mu) * g(t)m[i] = mu * m[i] + (1.0 – mu) * g[i]

Then, the second moment vector

 123 …# n(t) = nu * n(t-1) + (1 – nu) * g(t)^2n[i] = nu * n[i] + (1.0 – nu) * g[i]**2

Then the bias-corrected Nesterov momentum

 123 …# mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu))

The bias-correct second moment.

# nhat = nu * n(t) / (1 – nu)

nhat = nu * n[i] / (1.0 – nu)

And lastly updating the parameter

 123 …# x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat

This is then repeated for every parameter that is undergoing optimization.

At the conclusion of the iteration, we can assess the new parameter values and repot the performance of the search.

 12345 …# evaluate candidate pointscore = objective(x[0], x[1])# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))

We can connect all of this together into a function referred nadam() that takes the names of the objective and derivative functions, in addition to the algorithm hyperparameters, and returns the ideal solution identified at the conclusion of the search and its assessments.

 1234567891011121314151617181920212223242526272829 # gradient descent algorithm with nadamdef nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 – mu) * g(t)m[i] = mu * m[i] + (1.0 – mu) * g[i]# n(t) = nu * n(t-1) + (1 – nu) * g(t)^2n[i] = nu * n[i] + (1.0 – nu) * g[i]**2# mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu))# nhat = nu * n(t) / (1 – nu)nhat = nu * n[i] / (1.0 – nu)# x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return [x, score]

We can then go about defining the bounds of the function and the hyperparameters and refer to the function to carry out the optimization.

In this scenario, we will execute the algorithm for fifty iterations with an initial alpha of 0.02, mu of 0.8 and a nu of 0.999, discovered following a little trial and error.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# steps sizealpha = 0.02# factor for average gradientmu = 0.8# factor for average squared gradientnu = 0.999# perform the gradient descent search with nadambest, score = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)

At the conclusion of the run, we will report the ideal solution identified.

 1234 …# summarize the resultprint(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the complete instance of Nadam Gradient Descent applied to our test issue is detailed below.

Running the instance applies the optimization algorithm with Nadam to our test issue and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near-optimal solution was identified after probably 44 iterations of the search, with input values near 0.0 and 0.0 evaluating to 0.0.

 12345678910111213 …>40 f([ 5.07445337e-05 -3.32910019e-03]) = 0.00001>41 f([-1.84325171e-05 -3.00939427e-03]) = 0.00001>42 f([-6.78814472e-05 -2.69839367e-03]) = 0.00001>43 f([-9.88339249e-05 -2.40042096e-03]) = 0.00001>44 f([-0.00011368 -0.00211861]) = 0.00000>45 f([-0.00011547 -0.00185511]) = 0.00000>46 f([-0.0001075 -0.00161122]) = 0.00000>47 f([-9.29922627e-05 -1.38760991e-03]) = 0.00000>48 f([-7.48258406e-05 -1.18436586e-03]) = 0.00000>49 f([-5.54299505e-05 -1.00116899e-03]) = 0.00000Done!f([-5.54299505e-05 -1.00116899e-03]) = 0.000001

We can plot the progress of the Nadam search on a contour plot of the domain.

This can furnish an intuition for the progression of the search across the iterations of the algorithm.

We must go about updating the nadam() function to maintain a listing of all solutions identified during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed below:

 1234567891011121314151617181920212223242526272829303132 # gradient descent algorithm with nadamdef nadam(objective, derivative, bounds, n_iter, alpha, mu, nu, eps=1e-8):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])score = objective(x[0], x[1])# initialize decaying moving averagesm = [0.0 for _ in range(bounds.shape[0])]n = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x[0], x[1])# build a solution one variable at a timefor i in range(bounds.shape[0]):# m(t) = mu * m(t-1) + (1 – mu) * g(t)m[i] = mu * m[i] + (1.0 – mu) * g[i]# n(t) = nu * n(t-1) + (1 – nu) * g(t)^2n[i] = nu * n[i] + (1.0 – nu) * g[i]**2# mhat = (mu * m(t) / (1 – mu)) + ((1 – mu) * g(t) / (1 – mu))mhat = (mu * m[i] / (1.0 – mu)) + ((1 – mu) * g[i] / (1.0 – mu))# nhat = nu * n(t) / (1 – nu)nhat = nu * n[i] / (1.0 – nu)# x(t) = x(t-1) – alpha / (sqrt(nhat) + eps) * mhatx[i] = x[i] – alpha / (sqrt(nhat) + eps) * mhat# evaluate candidate pointscore = objective(x[0], x[1])# store solutionsolutions.append(x.copy())# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return solutions

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the ideal final solution.

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# steps size

alpha = 0.02

mu = 0.8

# factor for average squared gradient

nu = 0.999

solutions = nadam(objective, derivative, bounds, n_iter, alpha, mu, nu)

We can then develop a contour plot of the objective function, as prior.

 12345678910 …# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution identified during the search as a white dot linked by a line.

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the complete example of performing the Nadam optimization on the test problem and plotting the outcomes on a contour plot is detailed here.

Running the instance carries out the search as prior, except in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot.

This portion of the blog post details additional resources on the subject if you are seeking to delve deeper.

Papers

• Incorporating Nesterov Momentum into Adam, 2016
• Incorporating Nesterov Momentum into Adam, Stanford Report, 2015
• A method for solving the convex programming problem with convergence rate 0 (1/k^2), 1983
• Adam: A method for stochastic optimization, 2014
• An overview of Gradient Descent Optimization Algorithms, 2016

Books

• Algorithms for optimization, 2019
• Deep learning, 2016

APIs

• numpy.random.rand API
• numpy.asarray API
• Matplotlib API

Articles