>Business >Gradient Descent with AdaGrad from the ground up

Gradient Descent with AdaGrad from the ground up

Gradient descent can be defined as an optimisation algorithm that adheres to the negative gradient of an objective function to situate the minimum of that function.

A restriction of gradient descent is that it leverages the same step size (learning rate) for every input variable. This can be an issue on objective functions that possess differing amounts of curvature in differing dimensions, and in turn, might need a different sized step to a new point.

Adaptive Gradient or AdaGrad for short, is an extension of the gradient descent optimization algorithm that enables the step size in every dimension leveraged by the optimisation algorithm to be automatically adapted on the basis of the gradients observed for the variable (partial derivatives) observed over the duration of the search.

In this guide, you will find out how to develop the gradient descent with adaptive gradients optimisation algorithm from the ground up.

After going through this guide, you will be aware of:

  • Gradient descent being an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
  • Gradient descent can be updated to leverage an automatically adaptive step size for every input variable within the objective function, referred to as Adaptive Gradients or AdaGrad.
  • How to go about implementing the AdaGrad optimisation algorithm from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This tutorial is subdivide into three portions, which are:

1] Gradient Descent

2] Adaptive Gradient [AdaGrad]

3] Gradient Descent with AdaGrad

  • 2D Test Problem
  • Gradient Descent Optimisation with AdaGrad
  • Visualization of AdaGrad

Gradient Descent

Gradient Descent is in essence, an optimisation algorithm.

It is technically referenced to as a first-order optimisation algorithm as it overtly leverages the first order derivative of the targeted objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the derivative, is the rate of change or slope of the target function at a particular point, for instance, for a particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced to typically as the “gradient”.

  • Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient indicates in the direction of the steepest ascent of the targeted function for a particular input.

Gradient descent is in reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a targeted function that is being optimised and the derivative function for the objective function. The targeted function f() returns a score for a provided grouping of networks, and the derivative function f’() provides the derivative of the targeted function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated and a step is taken within the input space that is predicted to have the outcome of a downhill movement in the targeted function, assuming we are minimizing the targeted function.

A downhill movement is made by initially calculating how far to shift in the input space, calculated as the step size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we move against the gradient, or down the targeted function.

  • x = x – step_size * f’(x)

The steeper the objective function at a specific point, the bigger the magnitude of the gradient, and in turn, the bigger the step undertaken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

  • Step Size (alpha): Hyperparameter that manages how far to shift within the search space against the gradient every iteration of the algorithm.

If the step size is too small, the movement within the search space will be minimal and the search will take a protracted amount of time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s take a peek at AdaGrad.

Adaptive Gradient (AdaGrad)

The Adaptive Gradient algorithm, or AdaGrad for short, is an extension of the gradient descent optimisation algorithm.

The algorithm was detailed by John Duchi et al. in their 2011 paper entitled “Adaptive Subgradient Methods for Online Learning and Stochastic Optimisation”

It has been developed to accelerate the optimisation procedure, for example, reduce the number of function evaluations needed to attain the optima, or to enhance the capacity of the optimisation algorithm, for example, have the outcome of an improved final result.

The parameters with the biggest partial derivative of the loss possess a correspondingly rapid decrease in their learning rate, whereas parameters with small partial derivatives have a comparatively small decrease in their learning rate.

An issue with the gradient descent algorithm is that the step size (learning rate) is the same for every variable or dimension within the search space. It is possible that improved performance can be accomplished leveraging a step size that is customized to every variable, enabling larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.

AdaGrad is developed to particularly explore the concept of automatically tailoring the step size for every dimension within the search space.

AdaGrad, which is an adaptive subgradient methodology, adapts a learning rate for every component of x.

This is accomplished through initially calculating a step size for a provided dimension, then leveraging the calculated step size to make a movement in that specific dimension leveraging the partial derivative. This process is then rinsed and repeated for every dimension within the search space.

AdaGrad dulls the influence of parameters with constantly high gradients, therefore increasing the impact of parameters with infrequent updates.

AdaGrad is apt for objective functions where the curvature of the search space is different in differing dimensions, enabling a more effective optimisation provided the customisation of the step size within every dimension.

The algorithm needs that you set an initial step size for all input variables as per normal, like 0.1 or 0.001, or the like. Even though, the advantage of the algorithm is that it is not as sensitive to the initial learning rate as the gradient descent algorithm.

AdaGrad is a lot less sensitive to the learning rate parameter alpha. The learning rate parameter is usually set to a default value of 0.01.

An internal variable is then maintained for every input that is the total of the squared partial derivatives for the input variable observed in the course of the search.

This total of the squared partial derivatives is then leveraged to calculate the step size for the variable through dividing the initial step size value (for example, hyperparameter value mentioned at the beginning of the run) divided by the square root of the sum of the squared partial derivatives.

  • cust_step_size = step_size / sqrt(s)

It is possible for the square root of the sum of squared partial derivatives to have the outcome of a value of 0.0, resulting in a divide by zero error. Thus, a miniscule value can be added to the denominator to prevent this possibility, like 1e-8.

  • cust_step_size = step_size / (1e-8 + sqrt(s))

Where cust_step_size is the calculated step size for an input variable for the provided point in the search, step_size is the initial step size, sqrt() is the square root operation, and s is the total of the squared partial derivatives for the input variable observed over the course of the search thus far.

The custom step size is then leveraged to calculate the value for the variable in the next point or solution in the search.

  • x(t+1) = x(t) – cust_step_size * f’(x(t))

This process is then rinsed and repeated for every input variable till a new point in the search space is generated and can be assessed.

Critically, the partial derivative for the current solution (iteration of the search) is included in the total of the square root of partial derivatives.

We could maintain an array of partial derivatives or squared partial derivatives for every input variable, but this is not required. Rather, we merely maintain the sum of the squared partial derivatives and include new values to this sum along the way.

Now that we are acquainted with the AdaGrad algorithm, let’s look into how we might go about implementing it and assess its performance.

Gradient Descent with AdaGrad

In this portion of this blog post, we will look into how to implement the gradient descent optimisation algorithm with adaptive gradients.

2D Test Problem

To start with, let’s give definition to an optimisation function.

We will leverage a simple 2D function that squares the input of every dimension and define the range of relevant inputs from -1.0 to 1.0.

The objective() function below implements this function.

1

2

3

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

We can develop a 3D plot of the dataset to obtain a feel for the curvature of the response surface.

The complete instance of plotting the objective function is enlisted here.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

# 3d plot of the test function

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# define range for input

r_min, r_max = -1.0, 1.0

# sample input range uniformly at 0.1 increments

xaxis = arange(r_min, r_max, 0.1)

yaxis = arange(r_min, r_max, 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a surface plot with the jet color scheme

figure = pyplot.figure()

axis = figure.gca(projection=’3d’)

axis.plot_surface(x, y, results, cmap=’jet’)

# show the plot

pyplot.show()

 

Executing the instance creates a 3D surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0, 0) = 0

We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progression of the search.

The instance below creates a contour plot of the objective function.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

# contour plot of the test function

from numpy import asarray

from numpy import arange

from numpy import meshgrid

from matplotlib import pyplot

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# show the plot

pyplot.show()

 

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated with a colour gradient. We will leverage this plot to go about plotting the particular points looked into over the course of the search.

Now that we possess a test objective function, let’s observe how we might go about implementing the AdaGrad optimisation algorithm.

Gradient Descent Optimisation with AdaGrad

We can go about applying the gradient descent with adaptive gradient algorithm to the test problem.

To start with, we require a function that calculates the derivative for this function.

  • f(x) = x^2
  • f’(x) = x*2

The derivative of x^2 is x*2 in every dimension.

The derivative() function implements this below:

1

2

3

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

Then, we can go about implementing gradient descent with adaptive gradients.

To start with, we can choose an arbitrary point within the bounds of the problem as a beginning point for the search.

This makes the assumption we possess an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column gives definition to the maximum of that dimension.

1

2

3

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

 

Then, we are required to initialize the sum of the squared partial derivatives for every dimension to 0.0 values.

1

2

3

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

 

We can then enumerate a static number of iterations of the search optimisation algorithm defined by a “n_iter” hyperparameter.

 

1

2

3

4

# run the gradient descent

for it in range(n_iter):

 

The first step is to calculate the gradient for the present solution leveraging the derivative() function.

 

1

2

3

# calculate gradient

gradient = derivative(solution[0], solution[1])

 

We are then required to calculate the square of the partial derivative of every variable and add them to the running sum of these values.

 

1

2

3

4

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

 

We can then leverage the sum squared partial derivatives and gradient to calculate the next point.

We will perform this this a single variable at a time, to start with, calculating the step size for the variable, then the new value for that variable. These values are built up in an array till we possess a totally new solution that is in the steepest descent direction from the current point leveraging the custom step sizes.

 

1

2

3

4

5

6

7

8

9

10

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

 

This new solution can then be assessed leveraging the objective() function and the performance of the search can then be reported.

 

1

2

3

4

5

6

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

 

And we’re done.

We can connect all of this together into a function referred to as adagrad() that takes the names of the objective function and the derivative function, and array possessing the bounds of the domain, and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

The complete function is detailed here.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return [solution, solution_eval]

 

We have intentionally leveraged lists and imperative coding style rather than vectorized operations with regards to readability. You can adapt the implementation to a vectorized implementation with NumPy arrays for improved performance.

We can then define our hyperparameters and refer to the adagrad() function in optimizing our test objective function.

In this scenario, we will leverage fifty iterations of the algorithm and an initial learning rate of 0.1, both selected following a bit of trial and error.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search with adagrad

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

 

Connecting all of this info together, the complete example of gradient descent optimisation with adaptive gradients is detailed below.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

# gradient descent optimization with adagrad for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape[0]):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution[0], solution[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return [solution, solution_eval]

 

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search with adagrad

best, score = adagrad(objective, derivative, bounds, n_iter, step_size)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

 

Executing the instance applies the AdaGrad optimisation algorithm to our test problem and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near optimal solution was identified following probably 35 iterations of the search, with input values near 0.0 and 0.0 evaluating to 0.0.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

>0 f([-0.06595599  0.34064899]) = 0.12039

>1 f([-0.02902286  0.27948766]) = 0.07896

>2 f([-0.0129815   0.23463749]) = 0.05522

>3 f([-0.00582483  0.1993997 ]) = 0.03979

>4 f([-0.00261527  0.17071256]) = 0.02915

>5 f([-0.00117437  0.14686138]) = 0.02157

>6 f([-0.00052736  0.12676134]) = 0.01607

>7 f([-0.00023681  0.10966762]) = 0.01203

>8 f([-0.00010634  0.09503809]) = 0.00903

>9 f([-4.77542704e-05  8.24607972e-02]) = 0.00680

>10 f([-2.14444463e-05  7.16123835e-02]) = 0.00513

>11 f([-9.62980437e-06  6.22327049e-02]) = 0.00387

>12 f([-4.32434258e-06  5.41085063e-02]) = 0.00293

>13 f([-1.94188148e-06  4.70624414e-02]) = 0.00221

>14 f([-8.72017797e-07  4.09453989e-02]) = 0.00168

>15 f([-3.91586740e-07  3.56309531e-02]) = 0.00127

>16 f([-1.75845235e-07  3.10112252e-02]) = 0.00096

>17 f([-7.89647442e-08  2.69937139e-02]) = 0.00073

>18 f([-3.54597657e-08  2.34988084e-02]) = 0.00055

>19 f([-1.59234984e-08  2.04577993e-02]) = 0.00042

>20 f([-7.15057749e-09  1.78112581e-02]) = 0.00032

>21 f([-3.21102543e-09  1.55077005e-02]) = 0.00024

>22 f([-1.44193729e-09  1.35024688e-02]) = 0.00018

>23 f([-6.47513760e-10  1.17567908e-02]) = 0.00014

>24 f([-2.90771361e-10  1.02369798e-02]) = 0.00010

>25 f([-1.30573263e-10  8.91375193e-03]) = 0.00008

>26 f([-5.86349941e-11  7.76164047e-03]) = 0.00006

>27 f([-2.63305247e-11  6.75849105e-03]) = 0.00005

>28 f([-1.18239380e-11  5.88502652e-03]) = 0.00003

>29 f([-5.30963626e-12  5.12447017e-03]) = 0.00003

>30 f([-2.38433568e-12  4.46221948e-03]) = 0.00002

>31 f([-1.07070548e-12  3.88556303e-03]) = 0.00002

>32 f([-4.80809073e-13  3.38343471e-03]) = 0.00001

>33 f([-2.15911255e-13  2.94620023e-03]) = 0.00001

>34 f([-9.69567190e-14  2.56547145e-03]) = 0.00001

>35 f([-4.35392094e-14  2.23394494e-03]) = 0.00000

>36 f([-1.95516389e-14  1.94526160e-03]) = 0.00000

>37 f([-8.77982370e-15  1.69388439e-03]) = 0.00000

>38 f([-3.94265180e-15  1.47499203e-03]) = 0.00000

>39 f([-1.77048011e-15  1.28438640e-03]) = 0.00000

>40 f([-7.95048604e-16  1.11841198e-03]) = 0.00000

>41 f([-3.57023093e-16  9.73885702e-04]) = 0.00000

>42 f([-1.60324146e-16  8.48035867e-04]) = 0.00000

>43 f([-7.19948720e-17  7.38448972e-04]) = 0.00000

>44 f([-3.23298874e-17  6.43023418e-04]) = 0.00000

>45 f([-1.45180009e-17  5.59929193e-04]) = 0.00000

>46 f([-6.51942732e-18  4.87572776e-04]) = 0.00000

>47 f([-2.92760228e-18  4.24566574e-04]) = 0.00000

>48 f([-1.31466380e-18  3.69702307e-04]) = 0.00000

>49 f([-5.90360555e-19  3.21927835e-04]) = 0.00000

Done!

f([-5.90360555e-19  3.21927835e-04]) = 0.000000

 

Visualization of AdaGrad

We can plot the progress of the search over the iterations of the algorithm.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must update the adagrad() function to maintain a list of all solutions discovered during the search, then return this list at the conclusion of the search.

The updated variant of the function with these alterations is detailed here.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return solutions

 

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the ideal final solution.

1

2

3

4

5

6

7

8

9

10

11

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

 

We can then develop a contour plot of the objective function, as prior.

 

1

2

3

4

5

6

7

8

9

10

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

 

Lastly, we can plot every solution during the search as a white dot connected by a line.

 

1

2

3

4

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

 

Connecting all of this together, the complete instance of performing the AdaGrad optimisation on the test problem and plotting the outcomes on a contour plot is detailed below.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

# example of plotting the adagrad search on a contour plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

 

# objective function

def objective(x, y):

return x**2.0 + y**2.0

 

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

 

# gradient descent algorithm with adagrad

def adagrad(objective, derivative, bounds, n_iter, step_size):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the sum square gradients for each variable

sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

# run the gradient descent

for it in range(n_iter):

# calculate gradient

gradient = derivative(solution[0], solution[1])

# update the sum of the squared partial derivatives

for i in range(gradient.shape[0]):

sq_grad_sums[i] += gradient[i]**2.0

# build solution

new_solution = list()

for i in range(solution.shape[0]):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution[0], solution[1])

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return solutions

 

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.1

# perform the gradient descent search

solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

# show the plot

pyplot.show()

 

Executing the instance carries out the search as before, but not in this scenario, a contour plot of the objective function is developed and a white dot is displayed for every solution found during the search, beginning above the optima and progressively getting closer to the optima at the centre of the plot.

Further Reading

This portion of this blog post details additional resources on the subject if your looking to delve deeper.

Papers

Adaptive Subgradient Methods for Online Learning and Stochastic Optimisation, 2011

Books

Algorithms for Optimisation, 2019

Deep Learning, 2016

APIs

numpy.random.rand API

numpy.asarray API

Matplotlib API

Articles

Gradient Descent, Wikipedia

Stochastic gradient descent, Wikipedia

An overview of gradient descent optimisation algorithms, 2016

Conclusion

In this guide, you found out how to develop the gradient descent with adaptive gradients optimisation algorithm from the ground up.

Particularly, you learned:

  • Gradient descent is an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
  • Gradient descent can be updated to leverage an automatically adaptive step size for every input variable in the objective function, referred to as adaptive gradients or AdaGrad
  • How to go about implementing the AdaGrad optimisation algorithm from the ground up and apply it to an objective function and assess the outcome.
Add Comment