### Code Adam Optimization Algorithm from the ground up

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that a singular step (learning rate) is leveraged for all input variables. Extensions to gradient descent such as AdaGrad and RMSProp go about updating the algorithm to leverage an independent step size for every input variable which may have the outcome of a step size that swiftly reduces to very minimal values.

The Adaptive Movement Estimation algorithm, or Adam for short, is an extension to gradient descent and a natural successor to strategies like AdaGrad and RMSProp that automatically adapts a learning rate for every input variable for the objective function and further smoothens the search procedure through leveraging an exponentially reducing moving average of the gradient to make updates to variables.

In this guide, you will find out how to develop gradient descent with Adam optimization algorithm from the ground up.

After going through this guide, you will be aware of:

• Gradient descent is an optimization algorithm that leverages the gradient of the objective function to go about navigating the search space.
• Gradient descent can be updated to leverage an automatically adaptive step size for every input variable leveraging a decaying average of partial derivatives, referred to as Adam.
• How to implement the Adam optimization algorithm from the ground up and apply it to an objective function and assess the outcomes.

Tutorial Summarization

This tutorial is subdivided into three portions, which are:

• Two-Dimensional Test Problem

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly leverages the first-order derivative of the target objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular input.

If the target function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate target function might also be taken as a vector and is referenced to typically as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent references to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided set of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like a randomly chosen point in the input space.

The derivative is then calculated and a step is taken within the input space that is expected to have the outcome of a downhill movement within the target function, under the assumption we are minimizing the target function.

A downhill movement is made by initially calculating how to move in the input space, calculated as the step size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the present point, making sure we move against the gradient, or down the target function.

x(t) = x(t-1) – step_size * f’(x(t-1))

The steeper the objective function at a provided point, the magnitude of the gradient, and, in turn, the bigger the step taken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step size (alpha): Hyperparameter that manages how far to move in the search space against the gradient every iteration of the algorithm.

If the step size is too small, the movement within the search space will be minimal and the search will take a long time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s take a look at the Adam algorithm.

Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.

The algorithm was detailed in the 2014 paper by Diederik Kingma and Jimmy Lei Ba entitled “Adam: A Method for Stochastic Optimization”

Adam is developed to accelerate the optimization procedure, for example, reduce the number of function evaluations needed to reach the optima, or to enhance the capacity of the optimization algorithm, for example, have the outcome of an improved final result.

This is accomplished by calculating a step size for every input parameters that is being optimized. Critically, every step size is automatically adapted throughput the search process on the basis of the gradients (partial derivatives) faced for every variable.

We propose Adam, a strategy for effective stochastic optimization that only needs first-order gradients with minimal memory requirement. The method computes individual adaptive learning rates for differing parameters from estimates of first and second moments of the gradients; the name Adam is obtained from adaptive moment estimation.

This consists of maintaining a first and second moment of the gradient, for example, an exponentially decaying mean gradient (first moment) and variance (second moment) for every input variable.

The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient.

Let’s go through every element of the algorithm.

To start with, we must maintain a moment vector and exponentially weighted infinity norm for every parameter being optimized as portion of the search, referenced to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the beginning of the search.

m = 0

v = 0

The algorithm is carried out iteratively over time t beginning at t=1, and every iteration involves calculating a new grouping of parameter values x, for example, going from x(t-1) to x(t).

It is probably simple to comprehend the algorithm if we concentrate on updating of a single parameter, which generalizes to update all parameters through vector operations.

• g(t) = f’(x(t-1))

Next, the first moment is updated leveraging the squared gradient and a hyperparameter beta1.

• m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

Then the second moment is updated leveraging the squared gradient and a hyperparameter beta2.

• v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

The first and second moments are biased as they are initialized with zero values.

These moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, particularly during the initial timesteps, and particularly when the decay rates are small (that is, the betas are closer to 1). The good news is that this initialization bias can be simply counteracted, having the outcome of bias-corrected estimates.

Next, the first and second moments are bias-corrected, beginning with the starting moment:

• mhat(t) = m(t) / (1 – beta1(t))

And then, the second moment:

• vhat(t) = v(t) / (1 – beta2(t))

Note, beta1(t) and beta2(t) reference to the beta1 and beta2 hyperparameters that are decayed on a schedule across the iterations of the algorithm. A static decay schedule can be leveraged, even though the paper recommends the following:

• beta1(t) = beta1^t
• beta2(t) = beta2^t

Lastly, we can calculate the value for the parameter for this iteration.

x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

Where alpha is the step size hyperparameter, eps is a small value (epsilon) like 1e-8 that makes sure we do not face a divide by zero error, and sqrt() is the square root function.

Observe, a more efficient reordering of the update rule detailed in the paper can be leveraged:

• alpha(t) = alpha * sqrt(1 – beta2(t)) / (1 – beta1(t))
• x(t) = x(t-1) – alpha(t) * m(t) / (sqrt(v(t)) + eps)

To review, there are a trio of hyperparameters for the algorithm, which are:

• alpha: Initial step size (learning rate), a typical value is 0.001.
• beta1: Decay factor for first momentum, a common value is 0.9.
• beta2: Decay factor for infinity norm, a common value is 0.999.

And that’s it.

Next, let’s observe how we might implement the algorithm from the ground up in Python.

In this portion, we will look into how to implement the gradient descent optimization algorithm with Adam.

2D Test Problem

We will leverage a simplistic 2D function that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function.

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can develop a three-dimensional plot of the dataset to obtain a feeling for the curvature of the response surface.

The complete example of plotting the objective function is detailed below.

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Running the instance develops a three-dimensional surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0, 0) = 0. We can additionally develop a two-dimensional plot of the function. This will be beneficial later when we wish to plot the progress of the search.

The instance below develops a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Running the instance develops a two-dimensional contour plot of the objective function.

We can observe the bowl shape compressed to contours displayed with a colour gradient. We will leverage this plot to plot the particular points explored in the progression of the search. Now that we have a test objective function, let’s observe how we might implement the Adam optimization algorithm.

We can apply the gradient descent with Adam to the test problem.

To start with, we require a function that calculates the derivative for this function.

• f(x) = x^2
• f’(x) = x * 2

The derivative of x^2 is x * 2 in every dimension. The derivative() function implements this below.

 123 # derivative of objective functiondef derivative(x, y):return asarray([x * 2.0, y * 2.0])

Then, we can implement gradient descent optimization.

To start with, we can choose an arbitrary points in the bounds of the problem as a beginning point for the search.

This operates by the assumption we have an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

 1234 …# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])score = objective(x, x)

Then, we require to initialize the first and second moments to zero.

 1234 …# initialize first and second momentsm = [0.0 for _ in range(bounds.shape)]v = [0.0 for _ in range(bounds.shape)]

We then run a static number of iterations of the algorithm defined by the “n_iter” hyperparameter.

 1234 …# run iterations of gradient descentfor t in range(n_iter):…

The first phase is to calculate the gradient for the present solution leveraging the derivative() function.

Next, we calculate the derivative for the present set of parameters.

 123 …# calculate gradient g(t)g = derivative(x, x)

Then, we require to perform the Adam update calculations. We will carry out these calculations a single variable at a time leveraging an imperative programming style for readability.

Practically, it is recommended leveraging NumPy vector operations for efficiency.

 1234 …# build a solution one variable at a timefor i in range(x.shape):…

 123 …# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]

Then, the second moment.

 123 …# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2v[i] = beta2 * v[i] + (1.0 – beta2) * g[i]**2

Then the bias correction for the first and second moments.

 12345 …# mhat(t) = m(t) / (1 – beta1(t))mhat = m[i] / (1.0 – beta1**(t+1))# vhat(t) = v(t) / (1 – beta2(t))vhat = v[i] / (1.0 – beta2**(t+1))

Then lastly, the updated variable value.

 123 …# x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)x[i] = x[i] – alpha * mhat / (sqrt(vhat) + eps)

This is then repeated for every parameter that is being optimized.

At the conclusion of the iteration we can assess the new parameter values and report the performance of the search.

 12345 …# evaluate candidate pointscore = objective(x, x)# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))

We can connect all of this together into a function referred to as adam() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the ideal solution identified at the conclusion of the search and its evaluation.

The complete function is detailed here.

def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

score = objective(x, x)

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape)]

v = [0.0 for _ in range(bounds.shape)]

for t in range(n_iter):

g = derivative(x, x)

# build a solution one variable at a time

for i in range(x.shape):

# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 – beta2) * g[i]**2

# mhat(t) = m(t) / (1 – beta1(t))

mhat = m[i] / (1.0 – beta1**(t+1))

# vhat(t) = v(t) / (1 – beta2(t))

vhat = v[i] / (1.0 – beta2**(t+1))

# x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

x[i] = x[i] – alpha * mhat / (sqrt(vhat) + eps)

# evaluate candidate point

score = objective(x, x)

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

We have intentionally leveraged listings and imperative coding style rather than vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for improved performance.

We can then go about defining our hyperparameters and refer to the adam() function to optimize our test objective function.

In this scenario, we will leverage 60 iterations of the algorithm with an initial step size of 0.02 and beta1 and beta2 values of 0.8 and 0.999, respectively. These hyperparameter values were identified after a bit of trial and error.

 1234567891011121314151617 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 60# steps sizealpha = 0.02# factor for average gradientbeta1 = 0.8# factor for average squared gradientbeta2 = 0.999# perform the gradient descent search with adambest, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)print(‘Done!’)print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the complete example of gradient descent optimization with Adam is detailed below.

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

score = objective(x, x)

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape)]

v = [0.0 for _ in range(bounds.shape)]

for t in range(n_iter):

g = derivative(x, x)

# build a solution one variable at a time

for i in range(x.shape):

# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 – beta2) * g[i]**2

# mhat(t) = m(t) / (1 – beta1(t))

mhat = m[i] / (1.0 – beta1**(t+1))

# vhat(t) = v(t) / (1 – beta2(t))

vhat = v[i] / (1.0 – beta2**(t+1))

# x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + eps)

x[i] = x[i] – alpha * mhat / (sqrt(vhat) + eps)

# evaluate candidate point

score = objective(x, x)

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return [x, score]

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 60

# steps size

alpha = 0.02

beta1 = 0.8

# factor for average squared gradient

beta2 = 0.999

best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Running the instance applies the Adam optimization algorithm to our test problem and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Consider carrying out the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near-optimal solution was identified after probably 53 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

 12345678910111213 …>50 f([-0.00056912 -0.00321961]) = 0.00001>51 f([-0.00052452 -0.00286514]) = 0.00001>52 f([-0.00043908 -0.00251304]) = 0.00001>53 f([-0.0003283  -0.00217044]) = 0.00000>54 f([-0.00020731 -0.00184302]) = 0.00000>55 f([-8.95352320e-05 -1.53514076e-03]) = 0.00000>56 f([ 1.43050285e-05 -1.25002847e-03]) = 0.00000>57 f([ 9.67123406e-05 -9.89850279e-04]) = 0.00000>58 f([ 0.00015359 -0.00075587]) = 0.00000>59 f([ 0.00018407 -0.00054858]) = 0.00000Done!f([ 0.00018407 -0.00054858]) = 0.000000

We can go about plotting the progress of the Adam search on a contour plot of the domain.

This can furnish an intuition for the progress of the search over the iterations of the algorithm.

We must go about updating the adam() function to maintain a list of all solutions identified over the course of the search, then return this list at the conclusion of the search.

The updated version of the function with these modifications is detailed below.

 1234567891011121314151617181920212223242526272829303132 # gradient descent algorithm with adamdef adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):solutions = list()# generate an initial pointx = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])score = objective(x, x)# initialize first and second momentsm = [0.0 for _ in range(bounds.shape)]v = [0.0 for _ in range(bounds.shape)]# run the gradient descent updatesfor t in range(n_iter):# calculate gradient g(t)g = derivative(x, x)# build a solution one variable at a timefor i in range(bounds.shape):# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2v[i] = beta2 * v[i] + (1.0 – beta2) * g[i]**2# mhat(t) = m(t) / (1 – beta1(t))mhat = m[i] / (1.0 – beta1**(t+1))# vhat(t) = v(t) / (1 – beta2(t))vhat = v[i] / (1.0 – beta2**(t+1))# x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + ep)x[i] = x[i] – alpha * mhat / (sqrt(vhat) + eps)# evaluate candidate pointscore = objective(x, x)# keep track of solutionssolutions.append(x.copy())# report progressprint(‘>%d f(%s) = %.5f’ % (t, x, score))return solutions

We can then carry out the search as prior, and this time recover the list of solutions rather than the ideal final solution.

 123456789101112131415 …# seed the pseudo random number generatorseed(1)# define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# define the total iterationsn_iter = 60# steps sizealpha = 0.02# factor for average gradientbeta1 = 0.8# factor for average squared gradientbeta2 = 0.999# perform the gradient descent search with adamsolutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then develop a contour plot of the objective function, as prior.

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution identified during the search as a white dot connected by a line.

 1234 …# plot the sample as black circlessolutions = asarray(solutions)pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the complete instance of carrying out the Adam optimization on the test problem and plotting the outcomes on a contour plot is detailed below.

# example of plotting the adam search on a contour plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):

solutions = list()

# generate an initial point

x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

score = objective(x, x)

# initialize first and second moments

m = [0.0 for _ in range(bounds.shape)]

v = [0.0 for _ in range(bounds.shape)]

for t in range(n_iter):

g = derivative(x, x)

# build a solution one variable at a time

for i in range(bounds.shape):

# m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

m[i] = beta1 * m[i] + (1.0 – beta1) * g[i]

# v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

v[i] = beta2 * v[i] + (1.0 – beta2) * g[i]**2

# mhat(t) = m(t) / (1 – beta1(t))

mhat = m[i] / (1.0 – beta1**(t+1))

# vhat(t) = v(t) / (1 – beta2(t))

vhat = v[i] / (1.0 – beta2**(t+1))

# x(t) = x(t-1) – alpha * mhat(t) / (sqrt(vhat(t)) + ep)

x[i] = x[i] – alpha * mhat / (sqrt(vhat) + eps)

# evaluate candidate point

score = objective(x, x)

# keep track of solutions

solutions.append(x.copy())

# report progress

print(‘>%d f(%s) = %.5f’ % (t, x, score))

return solutions

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 60

# steps size

alpha = 0.02

beta1 = 0.8

# factor for average squared gradient

beta2 = 0.999

solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

# show the plot

pyplot.show()

Running the instance carries out the search as prior, except in this scenario, a contour plot of the objective function is developed.

In this scenario, we observe that a white dot is displayed for every solution identified during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot. This section furnishes additional resources on the subject if you are seeking to delve deeper.

Papers

Adam: A Method for Stochastic Optimization, 2014.

Books

Algorithms for Operation, 2019.

Deep Learning, 2016.

APIs

numpy.random.rand API

numpy.asarray API

Matplotlib API

Articles