Gradient descent can be defined as an optimisation algorithm that adheres to the negative gradient of an objective function to situate the minimum of that function.

A restriction of gradient descent is that it leverages the same step size (learning rate) for every input variable. This can be an issue on objective functions that possess differing amounts of curvature in differing dimensions, and in turn, might need a different sized step to a new point.

Adaptive Gradient or AdaGrad for short, is an extension of the gradient descent optimization algorithm that enables the step size in every dimension leveraged by the optimisation algorithm to be automatically adapted on the basis of the gradients observed for the variable (partial derivatives) observed over the duration of the search.

In this guide, you will find out how to develop the gradient descent with adaptive gradients optimisation algorithm from the ground up.

After going through this guide, you will be aware of:

• Gradient descent being an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
• How to go about implementing the AdaGrad optimisation algorithm from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This tutorial is subdivide into three portions, which are:

• 2D Test Problem

Gradient Descent is in essence, an optimisation algorithm.

It is technically referenced to as a first-order optimisation algorithm as it overtly leverages the first order derivative of the targeted objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the derivative, is the rate of change or slope of the target function at a particular point, for instance, for a particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate targeted function might also be taken as a vector and is referenced to typically as the “gradient”.

• Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient indicates in the direction of the steepest ascent of the targeted function for a particular input.

Gradient descent is in reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a targeted function that is being optimised and the derivative function for the objective function. The targeted function f() returns a score for a provided grouping of networks, and the derivative function f’() provides the derivative of the targeted function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated and a step is taken within the input space that is predicted to have the outcome of a downhill movement in the targeted function, assuming we are minimizing the targeted function.

A downhill movement is made by initially calculating how far to shift in the input space, calculated as the step size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we move against the gradient, or down the targeted function.

• x = x – step_size * f’(x)

The steeper the objective function at a specific point, the bigger the magnitude of the gradient, and in turn, the bigger the step undertaken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

• Step Size (alpha): Hyperparameter that manages how far to shift within the search space against the gradient every iteration of the algorithm.

If the step size is too small, the movement within the search space will be minimal and the search will take a protracted amount of time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s take a peek at AdaGrad.

The algorithm was detailed by John Duchi et al. in their 2011 paper entitled “Adaptive Subgradient Methods for Online Learning and Stochastic Optimisation”

It has been developed to accelerate the optimisation procedure, for example, reduce the number of function evaluations needed to attain the optima, or to enhance the capacity of the optimisation algorithm, for example, have the outcome of an improved final result.

The parameters with the biggest partial derivative of the loss possess a correspondingly rapid decrease in their learning rate, whereas parameters with small partial derivatives have a comparatively small decrease in their learning rate.

An issue with the gradient descent algorithm is that the step size (learning rate) is the same for every variable or dimension within the search space. It is possible that improved performance can be accomplished leveraging a step size that is customized to every variable, enabling larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.

AdaGrad is developed to particularly explore the concept of automatically tailoring the step size for every dimension within the search space.

This is accomplished through initially calculating a step size for a provided dimension, then leveraging the calculated step size to make a movement in that specific dimension leveraging the partial derivative. This process is then rinsed and repeated for every dimension within the search space.

AdaGrad is apt for objective functions where the curvature of the search space is different in differing dimensions, enabling a more effective optimisation provided the customisation of the step size within every dimension.

The algorithm needs that you set an initial step size for all input variables as per normal, like 0.1 or 0.001, or the like. Even though, the advantage of the algorithm is that it is not as sensitive to the initial learning rate as the gradient descent algorithm.

AdaGrad is a lot less sensitive to the learning rate parameter alpha. The learning rate parameter is usually set to a default value of 0.01.

An internal variable is then maintained for every input that is the total of the squared partial derivatives for the input variable observed in the course of the search.

This total of the squared partial derivatives is then leveraged to calculate the step size for the variable through dividing the initial step size value (for example, hyperparameter value mentioned at the beginning of the run) divided by the square root of the sum of the squared partial derivatives.

• cust_step_size = step_size / sqrt(s)

It is possible for the square root of the sum of squared partial derivatives to have the outcome of a value of 0.0, resulting in a divide by zero error. Thus, a miniscule value can be added to the denominator to prevent this possibility, like 1e-8.

• cust_step_size = step_size / (1e-8 + sqrt(s))

Where cust_step_size is the calculated step size for an input variable for the provided point in the search, step_size is the initial step size, sqrt() is the square root operation, and s is the total of the squared partial derivatives for the input variable observed over the course of the search thus far.

The custom step size is then leveraged to calculate the value for the variable in the next point or solution in the search.

• x(t+1) = x(t) – cust_step_size * f’(x(t))

This process is then rinsed and repeated for every input variable till a new point in the search space is generated and can be assessed.

Critically, the partial derivative for the current solution (iteration of the search) is included in the total of the square root of partial derivatives.

We could maintain an array of partial derivatives or squared partial derivatives for every input variable, but this is not required. Rather, we merely maintain the sum of the squared partial derivatives and include new values to this sum along the way.

Now that we are acquainted with the AdaGrad algorithm, let’s look into how we might go about implementing it and assess its performance.

In this portion of this blog post, we will look into how to implement the gradient descent optimisation algorithm with adaptive gradients.

2D Test Problem

We will leverage a simple 2D function that squares the input of every dimension and define the range of relevant inputs from -1.0 to 1.0.

The objective() function below implements this function.

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can develop a 3D plot of the dataset to obtain a feel for the curvature of the response surface.

The complete instance of plotting the objective function is enlisted here.

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 3D surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0, 0) = 0

We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progression of the search.

The instance below creates a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated with a colour gradient. We will leverage this plot to go about plotting the particular points looked into over the course of the search.

Now that we possess a test objective function, let’s observe how we might go about implementing the AdaGrad optimisation algorithm.

To start with, we require a function that calculates the derivative for this function.

• f(x) = x^2
• f’(x) = x*2

The derivative of x^2 is x*2 in every dimension.

The derivative() function implements this below:

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

To start with, we can choose an arbitrary point within the bounds of the problem as a beginning point for the search.

This makes the assumption we possess an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column gives definition to the maximum of that dimension.

 123 …# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Then, we are required to initialize the sum of the squared partial derivatives for every dimension to 0.0 values.

 123 …# list of the sum square gradients for each variablesq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a static number of iterations of the search optimisation algorithm defined by a “n_iter” hyperparameter.

 1234 …# run the gradient descentfor it in range(n_iter):…

The first step is to calculate the gradient for the present solution leveraging the derivative() function.

We are then required to calculate the square of the partial derivative of every variable and add them to the running sum of these values.

We can then leverage the sum squared partial derivatives and gradient to calculate the next point.

We will perform this this a single variable at a time, to start with, calculating the step size for the variable, then the new value for that variable. These values are built up in an array till we possess a totally new solution that is in the steepest descent direction from the current point leveraging the custom step sizes.

 12345678910 …# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))# calculate the new position in this variablevalue = solution[i] – alpha * gradient[i]# store this variablenew_solution.append(value)

This new solution can then be assessed leveraging the objective() function and the performance of the search can then be reported.

 123456 …# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

And we’re done.

We can connect all of this together into a function referred to as adagrad() that takes the names of the objective function and the derivative function, and array possessing the bounds of the domain, and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

The complete function is detailed here.

 12345678910111213141516171819202122232425262728 # gradient descent algorithm with adagraddef adagrad(objective, derivative, bounds, n_iter, step_size):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of the sum square gradients for each variablesq_grad_sums = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the sum of the squared partial derivativesfor i in range(gradient.shape[0]):sq_grad_sums[i] += gradient[i]**2.0# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the step size for this variablealpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))# calculate the new position in this variablevalue = solution[i] – alpha * gradient[i]# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return [solution, solution_eval]

We have intentionally leveraged lists and imperative coding style rather than vectorized operations with regards to readability. You can adapt the implementation to a vectorized implementation with NumPy arrays for improved performance.

We can then define our hyperparameters and refer to the adagrad() function in optimizing our test objective function.

In this scenario, we will leverage fifty iterations of the algorithm and an initial learning rate of 0.1, both selected following a bit of trial and error.

 12345678910111213 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# define the step sizestep_size = 0.1# perform the gradient descent search with adagradbest, score = adagrad(objective, derivative, bounds, n_iter, step_size)print(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this info together, the complete example of gradient descent optimisation with adaptive gradients is detailed below.

Executing the instance applies the AdaGrad optimisation algorithm to our test problem and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or differences in numerical accuracy. Take up executing the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near optimal solution was identified following probably 35 iterations of the search, with input values near 0.0 and 0.0 evaluating to 0.0.

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152 >0 f([-0.06595599  0.34064899]) = 0.12039>1 f([-0.02902286  0.27948766]) = 0.07896>2 f([-0.0129815   0.23463749]) = 0.05522>3 f([-0.00582483  0.1993997 ]) = 0.03979>4 f([-0.00261527  0.17071256]) = 0.02915>5 f([-0.00117437  0.14686138]) = 0.02157>6 f([-0.00052736  0.12676134]) = 0.01607>7 f([-0.00023681  0.10966762]) = 0.01203>8 f([-0.00010634  0.09503809]) = 0.00903>9 f([-4.77542704e-05  8.24607972e-02]) = 0.00680>10 f([-2.14444463e-05  7.16123835e-02]) = 0.00513>11 f([-9.62980437e-06  6.22327049e-02]) = 0.00387>12 f([-4.32434258e-06  5.41085063e-02]) = 0.00293>13 f([-1.94188148e-06  4.70624414e-02]) = 0.00221>14 f([-8.72017797e-07  4.09453989e-02]) = 0.00168>15 f([-3.91586740e-07  3.56309531e-02]) = 0.00127>16 f([-1.75845235e-07  3.10112252e-02]) = 0.00096>17 f([-7.89647442e-08  2.69937139e-02]) = 0.00073>18 f([-3.54597657e-08  2.34988084e-02]) = 0.00055>19 f([-1.59234984e-08  2.04577993e-02]) = 0.00042>20 f([-7.15057749e-09  1.78112581e-02]) = 0.00032>21 f([-3.21102543e-09  1.55077005e-02]) = 0.00024>22 f([-1.44193729e-09  1.35024688e-02]) = 0.00018>23 f([-6.47513760e-10  1.17567908e-02]) = 0.00014>24 f([-2.90771361e-10  1.02369798e-02]) = 0.00010>25 f([-1.30573263e-10  8.91375193e-03]) = 0.00008>26 f([-5.86349941e-11  7.76164047e-03]) = 0.00006>27 f([-2.63305247e-11  6.75849105e-03]) = 0.00005>28 f([-1.18239380e-11  5.88502652e-03]) = 0.00003>29 f([-5.30963626e-12  5.12447017e-03]) = 0.00003>30 f([-2.38433568e-12  4.46221948e-03]) = 0.00002>31 f([-1.07070548e-12  3.88556303e-03]) = 0.00002>32 f([-4.80809073e-13  3.38343471e-03]) = 0.00001>33 f([-2.15911255e-13  2.94620023e-03]) = 0.00001>34 f([-9.69567190e-14  2.56547145e-03]) = 0.00001>35 f([-4.35392094e-14  2.23394494e-03]) = 0.00000>36 f([-1.95516389e-14  1.94526160e-03]) = 0.00000>37 f([-8.77982370e-15  1.69388439e-03]) = 0.00000>38 f([-3.94265180e-15  1.47499203e-03]) = 0.00000>39 f([-1.77048011e-15  1.28438640e-03]) = 0.00000>40 f([-7.95048604e-16  1.11841198e-03]) = 0.00000>41 f([-3.57023093e-16  9.73885702e-04]) = 0.00000>42 f([-1.60324146e-16  8.48035867e-04]) = 0.00000>43 f([-7.19948720e-17  7.38448972e-04]) = 0.00000>44 f([-3.23298874e-17  6.43023418e-04]) = 0.00000>45 f([-1.45180009e-17  5.59929193e-04]) = 0.00000>46 f([-6.51942732e-18  4.87572776e-04]) = 0.00000>47 f([-2.92760228e-18  4.24566574e-04]) = 0.00000>48 f([-1.31466380e-18  3.69702307e-04]) = 0.00000>49 f([-5.90360555e-19  3.21927835e-04]) = 0.00000Done!f([-5.90360555e-19  3.21927835e-04]) = 0.000000

We can plot the progress of the search over the iterations of the algorithm.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must update the adagrad() function to maintain a list of all solutions discovered during the search, then return this list at the conclusion of the search.

The updated variant of the function with these alterations is detailed here.

 12345678910111213141516171819202122232425262728293031 # gradient descent algorithm with adagraddef adagrad(objective, derivative, bounds, n_iter, step_size):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of the sum square gradients for each variablesq_grad_sums = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the sum of the squared partial derivativesfor i in range(gradient.shape[0]):sq_grad_sums[i] += gradient[i]**2.0# build solutionnew_solution = list()for i in range(solution.shape[0]):# calculate the learning rate for this variablealpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))# calculate the new position in this variablevalue = solution[i] – alpha * gradient[i]new_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return solutions

We can then carry out the search as prior, and this time retrieve the listing of solutions rather than the ideal final solution.

 1234567891011 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# define the step sizestep_size = 0.1# perform the gradient descent searchsolutions = adagrad(objective, derivative, bounds, n_iter, step_size)

We can then develop a contour plot of the objective function, as prior.

 12345678910 …# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution during the search as a white dot connected by a line.

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the complete instance of performing the AdaGrad optimisation on the test problem and plotting the outcomes on a contour plot is detailed below.

 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374 # example of plotting the adagrad search on a contour plot of the test functionfrom math import sqrtfrom numpy import asarrayfrom numpy import arangefrom numpy.random import randfrom numpy.random import seedfrom numpy import meshgridfrom matplotlib import pyplotfrom mpl_toolkits.mplot3d import Axes3D # objective functiondef objective(x, y):return x**2.0 + y**2.0 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adagraddef adagrad(objective, derivative, bounds, n_iter, step_size):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# list of the sum square gradients for each variablesq_grad_sums = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate gradientgradient = derivative(solution[0], solution[1])# update the sum of the squared partial derivativesfor i in range(gradient.shape[0]):sq_grad_sums[i] += gradient[i]**2.0# build solutionnew_solution = list()for i in range(solution.shape[0]):# calculate the learning rate for this variablealpha = step_size / (1e-8 + sqrt(sq_grad_sums[i]))# calculate the new position in this variablevalue = solution[i] – alpha * gradient[i]new_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))return solutions # seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 50# define the step sizestep_size = 0.1# perform the gradient descent searchsolutions = adagrad(objective, derivative, bounds, n_iter, step_size)# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)# show the plotpyplot.show()

Executing the instance carries out the search as before, but not in this scenario, a contour plot of the objective function is developed and a white dot is displayed for every solution found during the search, beginning above the optima and progressively getting closer to the optima at the centre of the plot.

This portion of this blog post details additional resources on the subject if your looking to delve deeper.

Papers

Books

Algorithms for Optimisation, 2019

Deep Learning, 2016

APIs

numpy.random.rand API

numpy.asarray API

Matplotlib API

Articles