Gradient descent is an optimisation algorithm that follows the negative gradient of an objective function in order to situate the minimum of that function.

A restriction of gradient descent is that singular step size (learning rate) is leveraged for all input variables. Extensions to gradient descent, such as the Adaptive Movement Estimation (Adam) algorithm, leverages a independent step size for every input variable but may have the outcome of a step size that swiftly decreases to a very small values.

AdaMax is an extension to the Adam variant of gradient descent that goes about generalizing the approach to the infinite norm (max) and might have the outcome of a more effective optimisation on a few problems.

In this guide, you will find out about how to develop gradient descent optimisation with AdaMax from the ground up.

After going through this guide, you will be aware of:

• Gradient descent is an optimisation algorithm that leverages the gradient of the objective function to go about navigating the search space.
• AdaMax is an extension of the Adam variant of gradient descent developed to accelerate the optimisation procedure.
• How to implement the AdaMax optimisation from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This guide is subdivided into three portions, which are:

• 2D Test Problem

Gradient descent is an optimisation algorithm.

It is technically referred to as a first-order optimisation algorithm as it overtly leverages the first-order derivative of the target objective function.

First order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the largest function at a particular point, e.g., for particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be viewed of as vector. In turn, the derivative of a multivariate target function might also be taken as a vector and is generally referenced to as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent is a reference to a minimization optimisation algorithm that follows the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimised and the derivative function for the objective function. The targeted function (f) gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided grouping of inputs.

The gradient descent algorithm needs a beginning point (x) in the issue, like the arbitrarily chosen point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to have the outcome of a downhill movement in the target function, under the assumption we are minimising the target function.

A downhill movement is made by initially calculating how far to shift in the input space, calculated as the step size (referred to as Alpha or the learning rate) multiplied by the gradient. This is then subtracted from the present point, ensuring we shift against the gradient, or down the target function.

x(t) = x(t-1) – step_size * f’(x(t))

The steeper the objective function at a particular juncture, the bigger the magnitude of the gradient, and in turn, the bigger the step taken in the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size: Hyperparameter that handles how far to shift in the search space against the gradient at every iteration in the algorithm.

If the step size is too small, the shift in the search space will be minimal and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s observe the AdaMax algorithm.

AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) optimisation algorithm. More widely, is an extension to the Gradient Descent optimisation algorithm.

The algorithm was detailed in the 2014 research paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A method for Stochastic Optimisation”

Adam can be comprehended as updating weights inversely proportional to the scaled L2 norm (squared) of historical gradients. AdaMax extends this to the so-called infinite norm (max) of historical gradients.

Within Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and historical gradients.

Typically, AdaMax automatically adapts a separate step size for every parameter within the optimisation issue.

To start with, we must upkeep a moment vector and exponentially weighted infinity norm for every parameter being optimised as portion of the search, referred to as m and u respectively.

They are initialized to 0.0 at the beginning of the search.

• m=0
• u=0

The algorithm is executed iteratively over time t beginning at t=1, and every iteration includes calculating a new group of parameter values x, for example, going from x(t-1) to x(t)

It is probably simple to comprehend the algorithm if we concentrate on updating a single parameter, which generalizes to updating all parameters through vector operations.

• g(t)=f’(x(t-1))

Then, the moment vector is updated leveraging the gradient and a hyperparameter beta1.

• m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

The exponentially weighted infinity norm is updated leveraging the beta2 hyperparameter.

• u(t) = max (beta2 * u(t-1), abs(g(t)))

Where max () chooses the maximum of the parameters and abs () quantifies the absolute value.

We can then go about updating the parameter value. This can be broken down into three parts, the first quantifies the step size parameter, the second the gradient, and the third leverages the step size and gradient to quantify the new parameter value.

Let’s begin with calculating the step size for the parameter leveraging an initial step size hyperparameter referred to as alpha and a version of beta1 that is decaying with the passage of time with a particular value for this time step beta1(t):

• step_size(t) = alpha / (1 – beta1(t))

The gradient leveraged for updating the parameter is calculated as follows:

• delta(t) = m(t) / u(t)

Lastly, we can calculate the value for the parameter for this iteration.

• x(t) = x(t-1) – step_size(t) * delta(t)

Or the total update equation can be specified as:

• x(t) = x(t-1) – (alpha / (1 – beta1(t))) * m(t) / u(t)

In review, there are three hyperparameters with regards to the algorithm, which are:

• Alpha: Initial step size (learning rate), a common value is 0.002
• Beta1: Decay factor for first momentum, a common value is 0.9.
• Beta2: Decay factor for infinity norm, a usual value is 0.999.

The decay schedule for beta(1) indicated in the paper is quantified leveraging the initial beta1 value raised to the power t, even though other decay schedules could be leveraged like holding the value constant or decaying more aggressively.

• Beta1(t) = beta1^t

For the complete derivation of the AdaMax algorithm in the context of the Adam algorithm, we recommend looking into the paper, Adam: A Method for Stochastic Optimization, 2014.

To follow this up, let’s observe at how we might go about implementing the algorithm from the ground up in Python.

In this portion of this blog post, we will look into how to go about implementing the gradient descent optimisation algorithm with AdaMax momentum.

2D Test Problem

We will leverage a simplistic 2D function that goes about squaring the input of every dimension and define the range of valid inputs from -1.0 to 1.0

The objective () function below implements this:

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can create a 3D plotting of the dataset to obtain a feeling for the curvature of the response surface.

The complete instance of plotting of the objective function is detailed below:

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 3D surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0,0) = 0.

prog We can additionally create a 2D plot of the function. This will be good later on when we wish to plot the progression of the search.

The instance below creates a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated by means of a colour gradient. We will leverage this plot in plotting the particular points looked into during the progress of the search. Now that we possess a test objective function, let’s observe how we may go about implementing the AdaMax optimisation algorithm.

To start with, we require a function that quantifies the derivative for this function.

The derivative of x^2 is x*2 in every dimension.

f(x) = x^2

f’(x) = x*2

The derivative () function implements this below:

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

To start with, we can choose an arbitrary point in the bounds of the problem as beginning point for the search.

This goes by the assumption that we possess an array that defines the bounds of the search with a singular row for every dimension and the first column defines the minimum and the second column goes about defining the maximum of that dimension.

 123 …# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Then, we require to initialize the moment vector and exponentially weighted infinity norm.

 1234 …# initialize moment vector and weighted infinity normm = [0.0 for _ in range(bounds.shape)]u = [0.0 for _ in range(bounds.shape)]

Next, we execute a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

 1234 …# run iterations of gradient descentfor t in range(n_iter):…

The first action to be taken is calculating the derivative for the present grouping of parameters,

 123 …# calculate gradient g(t)g = derivative(x, x)

Then, we require to perform the AdaMax update calculations. We will execute these calculations a single variable at a time leveraging an imperative programming technique for readability.

In practice, it is recommended to leverage NumPy vector operations for efficiency.

 1234 …# build a solution one variable at a timefor i in range(x.shape):…

 123 …# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]

Then, we require to calculate the exponentially weighted infinity norm.

 123 …# u(t) = max(beta2 * u(t-1), abs(g(t)))u[i] = max(beta2 * u[i], abs(g[i]))

Next, the step size leveraged in the update.

 123 …# step_size(t) = alpha / (1 – beta1(t))step_size = alpha / (1.0 – beta1**(t+1))

And the alteration in variable:

 123 …# delta(t) = m(t) / u(t)delta = m[i] / u[i]

Lastly, we can calculate the fresh value for the variable.

 123 …# x(t) = x(t-1) – step_size(t) * delta(t)x[i] = x[i] – step_size * delta

This is then repeated for every parameter, we can assess the new parameter values and report the performance of the search.

 12345 …# evaluate candidate pointscore = objective(x, x)# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))

We can connect all of this together into a function called adamax() that takes the names of the objective and derivative functions in addition to the algorithm hyperparameters and gives back the ideal solution identified at the conclusion of the search and its evaluation.

 12345678910111213141516171819202122232425262728 # gradient descent algorithm with adamaxdef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# initialize moment vector and weighted infinity normm = [0.0 for _ in range(bounds.shape)]u = [0.0 for _ in range(bounds.shape)]# run iterations of gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x, x)# build a solution one variable at a timefor i in range(x.shape):# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]# u(t) = max(beta2 * u(t-1), abs(g(t)))u[i] = max(beta2 * u[i], abs(g[i]))# step_size(t) = alpha / (1 – beta1(t))step_size = alpha / (1.0 – beta1**(t+1))# delta(t) = m(t) / u(t)delta = m[i] / u[i]# x(t) = x(t-1) – step_size(t) * delta(t)x[i] = x[i] – step_size * delta# evaluate candidate pointscore = objective(x, x)# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return [x, score]

We can then go about defining the bounds of the function and the hyperparameters and refer to the function to perform the optimisation.

In this scenario, we will execute the algorithm for sixty iterations with a preliminary learning rate of 0.02, beta of 0.8 and a beta2 of 0.99, identified following a bit of trial and error:

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 60# steps sizealpha = 0.02# factor for average gradientbeta1 = 0.8# factor for average squared gradientbeta2 = 0.99# perform the gradient descent search with adamaxbest, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the conclusion of the run, we will report the ideal solution discovered.

 1234 …# summarize the resultprint(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the total instance of AdaMax gradient descent applied to our test issue is detailed below.

Executing the instance applies the optimisation algorithm with AdaMax to our test issue and reports the performance of the search for every iteration of the algorithm.

Note: Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or assessment procedure, or differences in numerical accuracy. Take up executing the instance a few times and contrast the mean outcome.

In this scenario, we can observe that a near-optimal solution was identified after probably 35 iterations of the search, with input values near 0.0 and 0.0 assessing to 0.0

 1234567891011121314151617181920 …>33 f([-0.00122185 0.00427944]) = 0.00002>34 f([-0.00045147 0.00289913]) = 0.00001>35 f([0.00022176 0.00165754]) = 0.00000>36 f([0.00073314 0.00058534]) = 0.00000>37 f([ 0.00105092 -0.00030082]) = 0.00000>38 f([ 0.00117382 -0.00099624]) = 0.00000>39 f([ 0.00112512 -0.00150609]) = 0.00000>40 f([ 0.00094497 -0.00184321]) = 0.00000>41 f([ 0.00068206 -0.002026 ]) = 0.00000>42 f([ 0.00038579 -0.00207647]) = 0.00000>43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000>44 f([-0.00014145 -0.00187632]) = 0.00000>45 f([-0.00031698 -0.00167338]) = 0.00000>46 f([-0.00041753 -0.00143134]) = 0.00000>47 f([-0.00044531 -0.00116942]) = 0.00000>48 f([-0.00041125 -0.00090399]) = 0.00000>49 f([-0.00033193 -0.00064834]) = 0.00000Done!f([-0.00033193 -0.00064834]) = 0.000001

We can go about plotting the progress of the AdaMax search on a contour plot of the domain.

This can furnish an intuition for the progression of the search over the iterations of the algorithm.

We must go about updating the adamax() function to upkeep a listing of all solutions discovered during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is detailed below:

 123456789101112131415161718192021222324252627282930 # gradient descent algorithm with adamaxdef adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])# initialize moment vector and weighted infinity normm = [0.0 for _ in range(bounds.shape)]u = [0.0 for _ in range(bounds.shape)]# run iterations of gradient descentfor t in range(n_iter):# calculate gradient g(t)g = derivative(x, x)# build a solution one variable at a timefor i in range(x.shape):# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]# u(t) = max(beta2 * u(t-1), abs(g(t)))u[i] = max(beta2 * u[i], abs(g[i]))# step_size(t) = alpha / (1 – beta1(t))step_size = alpha / (1.0 – beta1**(t+1))# delta(t) = m(t) / u(t)delta = m[i] / u[i]# x(t) = x(t-1) – step_size(t) * delta(t)x[i] = x[i] – step_size * delta# evaluate candidate pointscore = objective(x, x)solutions.append(x.copy())# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return solutions

We can then go about executing the search as prior, and this time recollect the listing of solutions rather than the best final solution.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 60# steps sizealpha = 0.02# factor for average gradientbeta1 = 0.8# factor for average squared gradientbeta2 = 0.99# perform the gradient descent search with adamaxsolutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can subsequently create a contour plot of the objective function, as prior.

 12345678910 …# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can go about plotting every solution during the search as white dot linked by a line:

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the total instance of executing the AdaMax optimisation on the test problem and plotting the outcomes on a contour plot is detailed below.

Executing the instance executes the search as prior, but in this scenario, the contour plot of the objective function is created.

In this scenario, we can observe that a white dot is demonstrated for every solution discovered during the course of the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot. This section furnishes additional resources on the subject if you’re looking for an in-depth dive.

Papers

• Adam: A Method for Stochastic Optimization, 2014
• An overview of Gradient Descent Optimisation algorithms, 2016

Books

• Algorithms for Optimization, 2019
• Deep Learning, 2016

APIs

• numpy.random.rand API
• numpy.asarray API
• Matplotlib API

Articles