>Business >Gradient Descent Optimisation with AMSGrad from the ground up

Gradient Descent Optimisation with AMSGrad from the ground up

Gradient descent is an optimisation algorithm that follows the negative gradient of an objective function in order the situate the minimum of the function.

A restriction of gradient descent is that a singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent such as the Adaptive Movement Estimation (Adam) algorithm leverage an independent step sizing for every input variable but might have the outcome of a step size that swiftly decreases to very minimal values.

AMSGrad is an extension to the Adam variant of gradient descent that makes an effort to enhance the convergence attributes of the algorithm, avoiding major sudden changes in the learning rate for every input variable.

In this guide, you will find out about how to generate gradient descent optimisation with AMSGrad from scratch.

Upon finishing this guide, you will be aware of:

  • Gradient descent being an optimisation algorithm that leverages the gradient of the objective function to go about navigating the search space.
  • AMSGrad is an extension of the Adam variant of gradient descent developed to accelerate the optimisation process.
  • How to go about implementing the AMSGrad optimisation algorithm from the ground up and go about applying it to an objective function and assess the outcomes.

Tutorial Summarization

This guide is subdivided into three portions, which are:

1] Gradient Descent

2] AMSGrad Optimization Algorithm

3] Gradient Descent with AMSGrad

  • 2D Test Problem
  • Gradient Descent Optimisation with AMSGrad
  • Visualization of AMSGrad Optimization

Gradient Descent

Gradient Descent is an optimisation algorithm.

It is technically referenced to as a first-order optimisation algorithm as it overtly leverages the first-order derivative of the targeted objective function.

First order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for instance, for a particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be viewed of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced generally as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative of the gradient indicates in the direction of the steepest ascent of the targeted function for a particular input.

Gradient descent is in reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a targeted function that is undergoing optimization and the derivative function for the objective function for the objective function. The targeted function f() gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the targeted function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point in the input space.

There derivative is then quantified and a step is undertaken within the input space that is forecasted to have the outcome of a downhill movement in the target function, going by the assumption we are minimizing the target function.

A downhill movement is made by initially calculating how far to move within the input space, calculated as the step size (referred to as Alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we move against the gradient, or down the targeted function.

  •  x(t) = x(t-1) – step_size * f’(x(t))

The steeper the objective function at a provided point, at a set point, the bigger the magnitude the gradient possesses, and in turn, the bigger the step undertaken within the search space. The size of the step undertaken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that manages how far to move in the search space against the gradient every iteration of the algorithm.

If the step size happens to be too small, the movement within the search space is bound to be small and the search will subsequently take a protracted amount of time. If the step size is excessively large, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s observe the AMSGrad algorithm.

AMSGrad Optimization Algorithm

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimisation algorithm. In a wider sense, is an extension to the Gradient Descent Optimisation algorithm.

The algorithm was detailed in the 2018 paper by J. Sashank, et. al. entitled “On the Convergence of Adam and Beyond.”

Typically, Adam automatically takes up a separate step size (learning rate) for every parameter in the optimization problem.

A restriction of Adam is that it can both reduce the step size when getting near to the optima, which is good, but it also increases the step size in some scenarios, which is a bad thing.

AdaGrad tackles this in particular:

ADAM aggressively improves the learning rate, although, this can be bad to the cumulative performance of the algorithm. By comparison, AMSGRAD neither enhances nor decreases the learning rate and further, decreases which can possibly lead to non-reducing learning rate even if the gradient is big in the future iterations.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and leverages it to bias-correct the gradient leveraged to go about updating the parameter, rather than the moment vector itself. This assists to cease the optimization slowing down too swiftly. (For instance, premature convergence)

The critical difference of AMSGRAD with ADAM is that it upkeeps the maximum of all vt until the current time step and leverages this maximum value for normalization of the running average of the gradient rather than vt in ADAM.

Let’s go through every aspect of the algorithm.

To start with, we ought to upkeep a first and second moment vector in addition to a max second moment vector for every parameter being optimized as portion of the search, referenced to as m and v. (Geek letter nu, however, we will leverage v) and vhat respectively.

They are initialized at 0.0 at the beginning of the search.

  • m = 0
  • v = 0
  • vhat = 0

The algorithm is executed iteratively with the passage of time t beginning at t=1, and every iteration consists of a calculating a new grouping of parameter values, x, for instance, going from x(t-1) to x(t).

It is probably simple to comprehend the algorithm if we concentrate on updation of a single parameter, which generalizes to updating all parameters through vector operations.

To start with, the gradients (partial derivatives) are calculated for the present time step.

  • g(t) = f’(x(t-1))

Then, the first moment vector is updated leveraging the gradient and a hyperparameter beta1.

  • m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The beta1 hyperparameter can be retained as constant or can be decayed exponentially over the duration of the search like:

  • beta1(t) = beta1^t

Or, alternatively,

  • beta1(t) = beta1/t

The second moment vector is updated leveraging the square of the gradient and a hyperparameter beta2.

  • v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Then, the maximum for the second moment vector is updated.

  • vhat(t) = max(vhat(t-1), v(t))

Where in, max() calculates the maximum of the furnished values.

The parameter value can then be updated leveraging the calculated terminology and the step size hyperparameter alpha.

  • x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where sqrt() is the square root function.

The step size might additionally be maintained constant or decayed exponentially.

In review, there are three hyperparameters for the algorithm, which are:

  • alpha: Initial step size (learning rate), a common value is 0.002.
  • beta1: Decay factor for first momentum, a common value is 0.9.
  • beta2: Decay factor for infinity norm, a common value is 0.999

And there you have it.

For complete derivation of the AMSGrad algorithm within the context of the Adam algorithm, ‘On the Convergence of Adam and Beyond, 2018’ is a highly recommended research paper.

Now, let’s observe how we might go about implementing the algorithm from the ground up in Python.

Gradient Descent with AMSGrad

In this portion of this blog article, we will look into how to go about implementing the gradient descent optimisation algorithm with AMSGrad Momentum.

2D Test Problem

To start with, let’s give definition to an optimisation function:

We will leverage a simplistic 2D function that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective() function below goes about implementing this:

1

2

3

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

We can create a 3D plot of the dataset to obtain a feel for the curvature of the response surface.

The full instance of plotting the objective function is detailed below.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# 3d plot of the test function

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# define range for input

r_min, r_max = -1.0, 1.0

# sample input range uniformly at 0.1 increments

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a surface plot with the jet color scheme

figure = pyplot.figure()

axis = figure.gca(projection=’3d’)

axis.plot_surface(x, y, results, cmap=’jet’)

# show the plot

pyplot.show()

 

Executing the instance creates a three-dimensional surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0,0) = 0.

We can also create a 2D plot of the function. This will be beneficial later on when we wish to plot the progress of the search.

The instance below creates a contour plot of the objective function.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

# contour plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# show the plot

pyplot.show()

 

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated with a colour gradient. We will leverage this plot to go about plotting the particular points looked into during the progress of the search.

Now that we possess a test objective function, let’s observe how we might implement the AMSGrad optimization algorithm.

Gradient Descent Optimization with AMSGrad

We can go about applying gradient descent with AMSGrad to the test issue.

To start with, we require a function that calculates the derivative for this particular function.

The derivative of x^2 is x*2 in every dimension.

  • f(x) = x^2
  • f’(x) = x * 2

The derivative() function implements this below:

 

1

2

3

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

Then we can go about implementing gradient descent optimization with AMSGrad.

To start with, we can choose an arbitrary point within the bounds of the issue as a beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a singular row for every dimension and the first column defines the minimum and the second column gives definition to the maximum of the dimension.

1

2

3

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

 

Then, we require to initialize the moment vectors.

 

1

2

3

4

5

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

vhat = [0.0 for _ in range(bounds.shape[0])]

 

We then execute a static number of iterations of the algorithm defined by the “n_iter” hyperparameter.

1

2

3

4

# run iterations of gradient descent

for t in range(n_iter):

 

The first stage is to calculate the derivative for the present grouping of parameters.

1

2

3

# calculate gradient g(t)

g = derivative(x[0], x[1])

 

Then, we require to perform the AMSGrad update calculations. We will perform these calculations a single variable at a time leveraging an imperative programming style for readability.

Practically, it is recommended leveraging NumPy vector operations for efficacy.

1

2

3

4

# build a solution one variable at a time

for i in range(x.shape[0]):

 

To start with, we require to calculate the first moment vector.

 

1

2

3

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

 

Then, the maximum of the second moment vector with the prior iteration and the present value.

1

2

3

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

 

Lastly, we can go about calculating the new value for the variable.

1

2

3

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / sqrt(vhat[i])

 

We may wish to add a small value to the denominator to avert a divide by zero error, for instance:

1

2

3

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

 

This is then repeated for every parameter that is undergoing optimization.

At the conclusion of the iteration, we can assess the fresh parameter values and report the performance of the search.

 

1

2

3

4

5

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

 

We can bring all of this together into a function called amsgrad() that takes the names of the objective and derivative functions in addition to the algorithm hyperparameters, and gives back the ideal solution identified at the conclusion of the search and its assessment.

 

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

# gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

vhat = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# update variables one at a time

for i in range(x.shape[0]):

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

 

We can then go about defining the bounds of the function and the hyperparameters and call the function to execute the optimization.

In this scenario, we will execute the algorithm for one hundred iterations with a preliminary learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, discovered following a bit of trial and error.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 100

# steps size

alpha = 0.007

# factor for average gradient

beta1 = 0.9

# factor for average squared gradient

beta2 = 0.99

# perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

 

At the conclusion of the run, we will report the ideal solution that was discovered.

 

1

2

3

4

# summarize the result

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

 

Connecting all of this together, the complete instance of the AMSGrad gradient descent applied to our test issue is detailed below.

# gradient descent optimization with amsgrad for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

# gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

vhat = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# update variables one at a time

for i in range(x.shape[0]):

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

# evaluate candidate point

score = objective(x[0], x[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

 

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 100

# steps size

alpha = 0.007

# factor for average gradient

beta1 = 0.9

# factor for average squared gradient

beta2 = 0.99

# perform the gradient descent search with amsgrad

best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

 

Executing the instance applies the optimisation algorithm with AMSGrad to our test issue and goes about reporting the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation process, or variations in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near-optimum solution was identified after probably 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0

 

1

2

3

4

5

6

7

8

9

10

11

12

13

>90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000

>91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000

>92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000

>93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000

>94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000

>95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000

>96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000

>97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000

>98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000

>99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000

Done!

f([-6.78222208e-12 1.27905115e-03]) = 0.000002

 

Visualization of AMSGrad Optimization

We can go about plotting the progression of the AMSGrad search on a contour plot of the domain.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must go about updating the amsgrad() function to upkeep a listing of all solutions identified during the search, then return this listing at the conclusion of the search.

The updated variant of the function with these modifications is detailed below.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

# gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):

solutions = list()

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

vhat = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# update variables one at a time

for i in range(x.shape[0]):

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

# evaluate candidate point

score = objective(x[0], x[1])

# keep track of all solutions

solutions.append(x.copy())

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return solutions

 

We can subsequently execute the search as prior, and this time get back the listing of solutions rather than the ideal final solution.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 100

# steps size

alpha = 0.007

# factor for average gradient

beta1 = 0.9

# factor for average squared gradient

beta2 = 0.99

# perform the gradient descent search with amsgrad

solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

 

We can then develop a contour plot of the objective function, as prior.

 

1

2

3

4

5

6

7

8

9

10

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

 

Lastly, we can plot every solution identified during the search as a white dot connected with a line.

1

2

3

4

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

 

Connecting all of this together, the full instance of performing the AMSGrad optimisation on the test issue and plotting the outcomes on a contour plot is detailed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

# example of plotting the amsgrad search on a contour plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

# gradient descent algorithm with amsgrad

def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2):

solutions = list()

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# initialize moment vectors

m = [0.0 for _ in range(bounds.shape[0])]

v = [0.0 for _ in range(bounds.shape[0])]

vhat = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for t in range(n_iter):

# calculate gradient g(t)

g = derivative(x[0], x[1])

# update variables one at a time

for i in range(x.shape[0]):

# m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

m[i] = beta1**(t+1) * m[i] + (1.0 – beta1**(t+1)) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = (beta2 * v[i]) + (1.0 – beta2) * g[i]**2

# vhat(t) = max(vhat(t-1), v(t))

vhat[i] = max(vhat[i], v[i])

# x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

x[i] = x[i] – alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

# evaluate candidate point

score = objective(x[0], x[1])

# keep track of all solutions

solutions.append(x.copy())

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return solutions

 

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 100

# steps size

alpha = 0.007

# factor for average gradient

beta1 = 0.9

# factor for average squared gradient

beta2 = 0.99

# perform the gradient descent search with amsgrad

solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

# show the plot

pyplot.show()

 

Executing the instance performs the search as prior, but in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is displayed for every solution identified in the course of the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot.

Further Reading

This section furnishes additional resources on the subject if you are seeking to delve deeper.

Papers

  • On the Convergence of Adam and Beyond, 2018.
  • An Overview of Gradient Descent Optimization Algorithms, 2016

Books

  • Algorithms for Optimization, 2019
  • Deep Learning, 2016

APIs

  • numpy.random.rand API
  • numpy.asarray API
  • Matplotlib API

Articles

  • Gradient Descent, Wikipedia
  • Stochastic Gradient Descent, Wikipedia
  • An overview of gradient descent optimization algorithms, 2016
  • Experiments with AMSGrad, 2017

Conclusion

In this guide, you found out how to develop gradient descent algorithm that leverages the gradient of the objective function to navigate the search space.

Particularly, you learned:

  • Gradient descent is an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
  • AMSGrad is an extension of the Adam version of gradient descent developed to accelerate the optimization procedure.
  • How to go about implementing the AMSGrad optimisation algorithm from the ground up and apply it to an objective function and assess the outcomes.
Add Comment