### Gradient Descent Optimisation with AdaMax From the ground up

Gradient descent is an optimisation algorithm that follows the negative gradient of an objective function in order to situate the minimum of that function.

A restriction of gradient descent is that singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent, such as the Adaptive Movement Estimation (Adam) algorithm, leverages a independent step size for every input variable but may have the outcome of a step size that swiftly decreases to a very small values.

AdaMax is an extension to the Adam variant of gradient descent that goes about generalizing the approach to the infinite norm (max) and might have the outcome of a more effective optimisation on a few problems.

In this guide, you will find out about how to develop gradient descent optimisation with AdaMax from the ground up.

After going through this guide, you will be aware of:

- Gradient descent is an optimisation algorithm that leverages the gradient of the objective function to go about navigating the search space.
- AdaMax is an extension of the Adam variant of gradient descent developed to accelerate the optimisation procedure.
- How to implement the AdaMax optimisation from the ground up and apply it to an objective function and assess the outcomes.

__Tutorial Summarization__

This guide is subdivided into three portions, which are:

1] Gradient Descent

2] AdaMax Optimisation Algorithm

3] Gradient Descent with AdaMax

- 2D Test Problem
- Gradient Descent Optimisation with AdaMax
- Visualization of AdaMax optimisation

__Gradient Descent__

Gradient descent is an optimisation algorithm.

It is technically referred to as a first-order optimisation algorithm as it overtly leverages the first-order derivative of the target objective function.

First order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the largest function at a particular point, e.g., for particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be viewed of as vector. In turn, the derivative of a multivariate target function might also be taken as a vector and is generally referenced to as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent is a reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimised and the derivative function for the objective function. The targeted function (f) gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the issue, like the arbitrarily chosen point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to have the outcome of a downhill movement in the target function, under the assumption we are minimising the target function.

A downhill movement is made by initially calculating how far to shift in the input space, calculated as the step size (referred to as Alpha or the learning rate) multiplied by the gradient. This is then subtracted from the present point, ensuring we shift against the gradient, or down the target function.

x(t) = x(t-1) – step_size * f’(x(t))

The steeper the objective function at a particular juncture, the bigger the magnitude of the gradient, and in turn, the bigger the step taken in the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that handles how far to shift in the search space against the gradient at every iteration in the algorithm.

If the step size is too small, the shift in the search space will be minimal and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s observe the AdaMax algorithm.

__AdaMax Optimisation Algorithm__

AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) optimisation algorithm. More widely, is an extension to the Gradient Descent optimisation algorithm.

The algorithm was detailed in the 2014 research paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A method for Stochastic Optimisation”

Adam can be comprehended as updating weights inversely proportional to the scaled L2 norm (squared) of historical gradients. AdaMax extends this to the so-called infinite norm (max) of historical gradients.

Within Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and historical gradients.

Typically, AdaMax automatically adapts a separate step size for every parameter within the optimisation issue.

To start with, we must upkeep a moment vector and exponentially weighted infinity norm for every parameter being optimised as portion of the search, referred to as m and u respectively.

They are initialized to 0.0 at the beginning of the search.

- m=0
- u=0

The algorithm is executed iteratively over time t beginning at t=1, and every iteration includes calculating a new group of parameter values x, for example, going from x(t-1) to x(t)

It is probably simple to comprehend the algorithm if we concentrate on updating a single parameter, which generalizes to updating all parameters through vector operations.

To start with, the gradient (partial derivatives) are quantified for the present time step.

- g(t)=f’(x(t-1))

Then, the moment vector is updated leveraging the gradient and a hyperparameter beta1.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

The exponentially weighted infinity norm is updated leveraging the beta2 hyperparameter.

- u(t) = max (beta2 * u(t-1), abs(g(t)))

Where max () chooses the maximum of the parameters and abs () quantifies the absolute value.

We can then go about updating the parameter value. This can be broken down into three parts, the first quantifies the step size parameter, the second the gradient, and the third leverages the step size and gradient to quantify the new parameter value.

Let’s begin with calculating the step size for the parameter leveraging an initial step size hyperparameter referred to as alpha and a version of beta1 that is decaying with the passage of time with a particular value for this time step beta1(t):

- step_size(t) = alpha / (1 – beta1(t))

The gradient leveraged for updating the parameter is calculated as follows:

- delta(t) = m(t) / u(t)

Lastly, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – step_size(t) * delta(t)

Or the total update equation can be specified as:

- x(t) = x(t-1) – (alpha / (1 – beta1(t))) * m(t) / u(t)

In review, there are three hyperparameters with regards to the algorithm, which are:

- Alpha: Initial step size (learning rate), a common value is 0.002
- Beta1: Decay factor for first momentum, a common value is 0.9.
- Beta2: Decay factor for infinity norm, a usual value is 0.999.

The decay schedule for beta(1) indicated in the paper is quantified leveraging the initial beta1 value raised to the power t, even though other decay schedules could be leveraged like holding the value constant or decaying more aggressively.

- Beta1(t) = beta1^t

And that’s about it.

For the complete derivation of the AdaMax algorithm in the context of the Adam algorithm, we recommend looking into the paper, Adam: A Method for Stochastic Optimization, 2014.

To follow this up, let’s observe at how we might go about implementing the algorithm from the ground up in Python.

__Gradient Descent with AdaMax__

In this portion of this blog post, we will look into how to go about implementing the gradient descent optimisation algorithm with AdaMax momentum.

__2D Test Problem__

To start with, let’s go about defining an optimisation problem.

We will leverage a simplistic 2D function that goes about squaring the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective () function below implements this:

1 2 3 | # objective function def objective(x, y): return x**2.0 + y**2.0 |

We can create a 3D plotting of the dataset to obtain a feeling for the curvature of the response surface.

The complete instance of plotting of the objective function is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot
# objective function def objective(x, y): return x**2.0 + y**2.0
# define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection=’3d’) axis.plot_surface(x, y, results, cmap=’jet’) # show the plot pyplot.show() |

Executing the instance creates a 3D surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0,0) = 0.

prog

We can additionally create a 2D plot of the function. This will be good later on when we wish to plot the progression of the search.

The instance below creates a contour plot of the objective function.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | # contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot
# objective function def objective(x, y): return x**2.0 + y**2.0
# define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) # show the plot pyplot.show() |

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated by means of a colour gradient. We will leverage this plot in plotting the particular points looked into during the progress of the search.

Now that we possess a test objective function, let’s observe how we may go about implementing the AdaMax optimisation algorithm.

__Gradient Descent Optimization with AdaMax__

We can go about applying the gradient descent with AdaMax to the test issue.

To start with, we require a function that quantifies the derivative for this function.

The derivative of x^2 is x*2 in every dimension.

f(x) = x^2

f’(x) = x*2

The derivative () function implements this below:

1 2 3 | # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) |

Then we can implement gradient descent optimisation using AdaMax.

To start with, we can choose an arbitrary point in the bounds of the problem as beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a singular row for every dimension and the first column defines the minimum and the second column goes about defining the maximum of that dimension.

1 2 3 | … # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) |

Then, we require to initialize the moment vector and exponentially weighted infinity norm.

1 2 3 4 | … # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] |

Next, we execute a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

1 2 3 4 | … # run iterations of gradient descent for t in range(n_iter): … |

The first action to be taken is calculating the derivative for the present grouping of parameters,

1 2 3 | … # calculate gradient g(t) g = derivative(x[0], x[1]) |

Then, we require to perform the AdaMax update calculations. We will execute these calculations a single variable at a time leveraging an imperative programming technique for readability.

In practice, it is recommended to leverage NumPy vector operations for efficiency.

1 2 3 4 | … # build a solution one variable at a time for i in range(x.shape[0]): … |

To start with, we require to calculate the moment vector.

1 2 3 | … # m(t) = beta1 * m(t-1) + (1 – beta1) * g(t) m[i] = beta1 * m[i] + (1.0 – beta1) * g[i] |

Then, we require to calculate the exponentially weighted infinity norm.

1 2 3 | … # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) |

Next, the step size leveraged in the update.

1 2 3 | … # step_size(t) = alpha / (1 – beta1(t)) step_size = alpha / (1.0 – beta1**(t+1)) |

And the alteration in variable:

1 2 3 | … # delta(t) = m(t) / u(t) delta = m[i] / u[i] |

Lastly, we can calculate the fresh value for the variable.

1 2 3 | … # x(t) = x(t-1) – step_size(t) * delta(t) x[i] = x[i] – step_size * delta |

This is then repeated for every parameter, we can assess the new parameter values and report the performance of the search.

1 2 3 4 5 | … # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) |

We can connect all of this together into a function called adamax() that takes the names of the objective and derivative functions in addition to the algorithm hyperparameters and gives back the ideal solution identified at the conclusion of the search and its evaluation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 – beta1) * g(t) m[i] = beta1 * m[i] + (1.0 – beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 – beta1(t)) step_size = alpha / (1.0 – beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) – step_size(t) * delta(t) x[i] = x[i] – step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return [x, score] |

We can then go about defining the bounds of the function and the hyperparameters and refer to the function to perform the optimisation.

In this scenario, we will execute the algorithm for sixty iterations with a preliminary learning rate of 0.02, beta of 0.8 and a beta2 of 0.99, identified following a bit of trial and error:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | … # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) |

At the conclusion of the run, we will report the ideal solution discovered.

1 2 3 4 | … # summarize the result print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Connecting all of this together, the total instance of AdaMax gradient descent applied to our test issue is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | # gradient descent optimization with adamax for a two-dimensional test function from numpy import asarray from numpy.random import rand from numpy.random import seed
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 – beta1) * g(t) m[i] = beta1 * m[i] + (1.0 – beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 – beta1(t)) step_size = alpha / (1.0 – beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) – step_size(t) * delta(t) x[i] = x[i] – step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return [x, score]
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # summarize the result print(‘Done!’) print(‘f(%s) = %f’ % (best, score)) |

Executing the instance applies the optimisation algorithm with AdaMax to our test issue and reports the performance of the search for every iteration of the algorithm.

Note: Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Take up executing the instance a few times and contrast the mean outcome.

In this scenario, we can observe that a near-optimal solution was identified after probably 35 iterations of the search, with input values near 0.0 and 0.0 assessing to 0.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | … >33 f([-0.00122185 0.00427944]) = 0.00002 >34 f([-0.00045147 0.00289913]) = 0.00001 >35 f([0.00022176 0.00165754]) = 0.00000 >36 f([0.00073314 0.00058534]) = 0.00000 >37 f([ 0.00105092 -0.00030082]) = 0.00000 >38 f([ 0.00117382 -0.00099624]) = 0.00000 >39 f([ 0.00112512 -0.00150609]) = 0.00000 >40 f([ 0.00094497 -0.00184321]) = 0.00000 >41 f([ 0.00068206 -0.002026 ]) = 0.00000 >42 f([ 0.00038579 -0.00207647]) = 0.00000 >43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000 >44 f([-0.00014145 -0.00187632]) = 0.00000 >45 f([-0.00031698 -0.00167338]) = 0.00000 >46 f([-0.00041753 -0.00143134]) = 0.00000 >47 f([-0.00044531 -0.00116942]) = 0.00000 >48 f([-0.00041125 -0.00090399]) = 0.00000 >49 f([-0.00033193 -0.00064834]) = 0.00000 Done! f([-0.00033193 -0.00064834]) = 0.000001 |

__Visualization of AdaMax Optimisation__

We can go about plotting the progress of the AdaMax search on a contour plot of the domain.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must go about updating the adamax() function to upkeep a listing of all solutions discovered during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 – beta1) * g(t) m[i] = beta1 * m[i] + (1.0 – beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 – beta1(t)) step_size = alpha / (1.0 – beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) – step_size(t) * delta(t) x[i] = x[i] – step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return solutions |

We can then go about executing the search as prior, and this time recollect the listing of solutions rather than the best final solution.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | … # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) |

We can subsequently create a contour plot of the objective function, as prior.

1 2 3 4 5 6 7 8 9 10 | … # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) |

Lastly, we can go about plotting every solution during the search as white dot linked by a line:

1 2 3 4 | … # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) |

Connecting all of this together, the total instance of executing the AdaMax optimisation on the test problem and plotting the outcomes on a contour plot is detailed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | # example of plotting the adamax search on a contour plot of the test function from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D
# objective function def objective(x, y): return x**2.0 + y**2.0
# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])
# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 – beta1) * g(t) m[i] = beta1 * m[i] + (1.0 – beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 – beta1(t)) step_size = alpha / (1.0 – beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) – step_size(t) * delta(t) x[i] = x[i] – step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print(‘>%d f(%s) = %.5f’ % (t, x, score)) return solutions
# seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap=’jet’) # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’) # show the plot pyplot.show() |

Executing the instance executes the search as prior, but in this scenario, the contour plot of the objective function is created.

In this scenario, we can observe that a white dot is demonstrated for every solution discovered during the course of the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot.

__Further Reading__

This section furnishes additional resources on the subject if you’re looking for an in-depth dive.

*Papers*

- Adam: A Method for Stochastic Optimization, 2014
- An overview of Gradient Descent Optimisation algorithms, 2016

*Books*

- Algorithms for Optimization, 2019
- Deep Learning, 2016

*APIs*

- numpy.random.rand API
- numpy.asarray API
- Matplotlib API

*Articles*

- Gradient Descent, Wikipedia
- Stochastic Gradient Descent, Wikipedia
- An overview of gradient descent optimisation algorithm, 2016

__Conclusion__

In this guide, you found out about how to develop the gradient descent optimization with AdaMax from the ground up.

Particularly, you learned:

- Gradient descent being an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
- AdaMax is an extension of the Adam version of gradient descent developed to accelerate the optimisation procedure.
- How to implement the AdaMax optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.