 Gradient descent is an optimisation algorithm that follows the negative gradient of an objective function in order the situate the minimum of the function.

A restriction of gradient descent is that a singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent such as the Adaptive Movement Estimation (Adam) algorithm leverage an independent step sizing for every input variable but might have the outcome of a step size that swiftly decreases to very minimal values.

AMSGrad is an extension to the Adam variant of gradient descent that makes an effort to enhance the convergence attributes of the algorithm, avoiding major sudden changes in the learning rate for every input variable.

In this guide, you will find out about how to generate gradient descent optimisation with AMSGrad from scratch.

Upon finishing this guide, you will be aware of:

• Gradient descent being an optimisation algorithm that leverages the gradient of the objective function to go about navigating the search space.
• AMSGrad is an extension of the Adam variant of gradient descent developed to accelerate the optimisation process.
• How to go about implementing the AMSGrad optimisation algorithm from the ground up and go about applying it to an objective function and assess the outcomes.

Tutorial Summarization

This guide is subdivided into three portions, which are:

• 2D Test Problem

Gradient Descent is an optimisation algorithm.

It is technically referenced to as a first-order optimisation algorithm as it overtly leverages the first-order derivative of the targeted objective function.

First order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for instance, for a particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be viewed of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced generally as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative of the gradient indicates in the direction of the steepest ascent of the targeted function for a particular input.

Gradient descent is in reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a targeted function that is undergoing optimization and the derivative function for the objective function for the objective function. The targeted function f() gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the targeted function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point in the input space.

There derivative is then quantified and a step is undertaken within the input space that is forecasted to have the outcome of a downhill movement in the target function, going by the assumption we are minimizing the target function.

A downhill movement is made by initially calculating how far to move within the input space, calculated as the step size (referred to as Alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we move against the gradient, or down the targeted function.

•  x(t) = x(t-1) – step_size * f’(x(t))

The steeper the objective function at a provided point, at a set point, the bigger the magnitude the gradient possesses, and in turn, the bigger the step undertaken within the search space. The size of the step undertaken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that manages how far to move in the search space against the gradient every iteration of the algorithm.

If the step size happens to be too small, the movement within the search space is bound to be small and the search will subsequently take a protracted amount of time. If the step size is excessively large, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s observe the AMSGrad algorithm.

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimisation algorithm. In a wider sense, is an extension to the Gradient Descent Optimisation algorithm.

The algorithm was detailed in the 2018 paper by J. Sashank, et. al. entitled “On the Convergence of Adam and Beyond.”

Typically, Adam automatically takes up a separate step size (learning rate) for every parameter in the optimization problem.

A restriction of Adam is that it can both reduce the step size when getting near to the optima, which is good, but it also increases the step size in some scenarios, which is a bad thing.

ADAM aggressively improves the learning rate, although, this can be bad to the cumulative performance of the algorithm. By comparison, AMSGRAD neither enhances nor decreases the learning rate and further, decreases which can possibly lead to non-reducing learning rate even if the gradient is big in the future iterations.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and leverages it to bias-correct the gradient leveraged to go about updating the parameter, rather than the moment vector itself. This assists to cease the optimization slowing down too swiftly. (For instance, premature convergence)

The critical difference of AMSGRAD with ADAM is that it upkeeps the maximum of all vt until the current time step and leverages this maximum value for normalization of the running average of the gradient rather than vt in ADAM.

Let’s go through every aspect of the algorithm.

To start with, we ought to upkeep a first and second moment vector in addition to a max second moment vector for every parameter being optimized as portion of the search, referenced to as m and v. (Geek letter nu, however, we will leverage v) and vhat respectively.

They are initialized at 0.0 at the beginning of the search.

• m = 0
• v = 0
• vhat = 0

The algorithm is executed iteratively with the passage of time t beginning at t=1, and every iteration consists of a calculating a new grouping of parameter values, x, for instance, going from x(t-1) to x(t).

It is probably simple to comprehend the algorithm if we concentrate on updation of a single parameter, which generalizes to updating all parameters through vector operations.

• g(t) = f’(x(t-1))

Then, the first moment vector is updated leveraging the gradient and a hyperparameter beta1.

• m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The beta1 hyperparameter can be retained as constant or can be decayed exponentially over the duration of the search like:

• beta1(t) = beta1^t

Or, alternatively,

• beta1(t) = beta1/t

The second moment vector is updated leveraging the square of the gradient and a hyperparameter beta2.

• v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Then, the maximum for the second moment vector is updated.

• vhat(t) = max(vhat(t-1), v(t))

Where in, max() calculates the maximum of the furnished values.

The parameter value can then be updated leveraging the calculated terminology and the step size hyperparameter alpha.

• x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where sqrt() is the square root function.

The step size might additionally be maintained constant or decayed exponentially.

In review, there are three hyperparameters for the algorithm, which are:

• alpha: Initial step size (learning rate), a common value is 0.002.
• beta1: Decay factor for first momentum, a common value is 0.9.
• beta2: Decay factor for infinity norm, a common value is 0.999

And there you have it.

For complete derivation of the AMSGrad algorithm within the context of the Adam algorithm, ‘On the Convergence of Adam and Beyond, 2018’ is a highly recommended research paper.

Now, let’s observe how we might go about implementing the algorithm from the ground up in Python.

In this portion of this blog article, we will look into how to go about implementing the gradient descent optimisation algorithm with AMSGrad Momentum.

2D Test Problem

We will leverage a simplistic 2D function that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective() function below goes about implementing this:

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can create a 3D plot of the dataset to obtain a feel for the curvature of the response surface.

The full instance of plotting the objective function is detailed below.

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a three-dimensional surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0,0) = 0. We can also create a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below creates a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated with a colour gradient. We will leverage this plot to go about plotting the particular points looked into during the progress of the search. Now that we possess a test objective function, let’s observe how we might implement the AMSGrad optimization algorithm.

To start with, we require a function that calculates the derivative for this particular function.

The derivative of x^2 is x*2 in every dimension.

• f(x) = x^2
• f’(x) = x * 2

The derivative() function implements this below:

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

To start with, we can choose an arbitrary point within the bounds of the issue as a beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a singular row for every dimension and the first column defines the minimum and the second column gives definition to the maximum of the dimension.

 123 …# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Then, we require to initialize the moment vectors.

 12345 …# initialize moment vectorsm = [0.0 for _ in range(bounds.shape)]v = [0.0 for _ in range(bounds.shape)]vhat = [0.0 for _ in range(bounds.shape)]

We then execute a static number of iterations of the algorithm defined by the “n_iter” hyperparameter.

 1234 …# run iterations of gradient descentfor t in range(n_iter):…

The first stage is to calculate the derivative for the present grouping of parameters.

 123 …# calculate gradient g(t)g = derivative(x, x)

Then, we require to perform the AMSGrad update calculations. We will perform these calculations a single variable at a time leveraging an imperative programming style for readability.

Practically, it is recommended leveraging NumPy vector operations for efficacy.

 1234 …# build a solution one variable at a timefor i in range(x.shape):…

 123 …# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

Then, the maximum of the second moment vector with the prior iteration and the present value.

 123 …# vhat(t) = max(vhat(t-1), v(t))vhat[i] = max(vhat[i], v[i])

Lastly, we can go about calculating the new value for the variable.

 123 …# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))x[i] = x[i] – alpha * m[i] / sqrt(vhat[i])

We may wish to add a small value to the denominator to avert a divide by zero error, for instance:

 123 …# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

This is then repeated for every parameter that is undergoing optimization.

At the conclusion of the iteration, we can assess the fresh parameter values and report the performance of the search.

 12345 …# evaluate candidate pointscore = objective(x, x)# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))

We can bring all of this together into a function called amsgrad() that takes the names of the objective and derivative functions in addition to the algorithm hyperparameters, and gives back the ideal solution identified at the conclusion of the search and its assessment.

 23456789101112131415161718192021222324252627 # gradient descent algorithm with amsgraddef amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# initialize moment vectorsm = [0.0 for _ in range(bounds.shape)]v = [0.0 for _ in range(bounds.shape)]vhat = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x, x)# update variables one at a timefor i in range(x.shape):# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2# vhat(t) = max(vhat(t-1), v(t))vhat[i] = max(vhat[i], v[i])# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)# evaluate candidate pointscore = objective(x, x)# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return [x, score]

We can then go about defining the bounds of the function and the hyperparameters and call the function to execute the optimization.

In this scenario, we will execute the algorithm for one hundred iterations with a preliminary learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, discovered following a bit of trial and error.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 100# steps sizealpha = 0.007# factor for average gradientbeta1 = 0.9# factor for average squared gradientbeta2 = 0.99# perform the gradient descent search with amsgradbest, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the conclusion of the run, we will report the ideal solution that was discovered.

 1234 …# summarize the resultprint(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the complete instance of the AMSGrad gradient descent applied to our test issue is detailed below.

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape)]

v = [0.0 for _ in range(bounds.shape)]

vhat = [0.0 for _ in range(bounds.shape)]

for t in range(n_iter):

g = derivative(x, x)

# update variables one at a time

for i in range(x.shape):

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

# evaluate candidate point

score = objective(x, x)

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 100

# steps size

alpha = 0.007

beta1 = 0.9

# factor for average squared gradient

beta2 = 0.99

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Executing the instance applies the optimisation algorithm with AMSGrad to our test issue and goes about reporting the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near-optimum solution was identified after probably 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0

 12345678910111213 …>90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000>91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000>92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000>93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000>94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000>95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000>96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000>97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000>98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000>99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000Done!f([-6.78222208e-12 1.27905115e-03]) = 0.000002

We can go about plotting the progression of the AMSGrad search on a contour plot of the domain.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must go about updating the amsgrad() function to upkeep a listing of all solutions identified during the search, then return this listing at the conclusion of the search.

The updated variant of the function with these modifications is detailed below.

 123456789101112131415161718192021222324252627282930 # gradient descent algorithm with amsgraddef amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# initialize moment vectorsm = [0.0 for _ in range(bounds.shape)]v = [0.0 for _ in range(bounds.shape)]vhat = [0.0 for _ in range(bounds.shape)]# run the gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x, x)# update variables one at a timefor i in range(x.shape):# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2# vhat(t) = max(vhat(t-1), v(t))vhat[i] = max(vhat[i], v[i])# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)# evaluate candidate pointscore = objective(x, x)# keep track of all solutionssolutions.append(x.copy())# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return solutions

We can subsequently execute the search as prior, and this time get back the listing of solutions rather than the ideal final solution.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 100# steps sizealpha = 0.007# factor for average gradientbeta1 = 0.9# factor for average squared gradientbeta2 = 0.99# perform the gradient descent search with amsgradsolutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then develop a contour plot of the objective function, as prior.

 12345678910 …# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution identified during the search as a white dot connected with a line.

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the full instance of performing the AMSGrad optimisation on the test issue and plotting the outcomes on a contour plot is detailed below.

Executing the instance performs the search as prior, but in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified in the course of the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot. This section furnishes additional resources on the subject if you are seeking to delve deeper.

Papers

• On the Convergence of Adam and Beyond, 2018.
• An Overview of Gradient Descent Optimization Algorithms, 2016

Books

• Algorithms for Optimization, 2019
• Deep Learning, 2016

APIs

• numpy.random.rand API
• numpy.asarray API
• Matplotlib API

Articles