### Gradient Descent with RMSProp from the ground up

Gradient descent is an optimisation algorithm that follows the negative gradient of an objective function to situate the minimum of that function.

A restriction of gradient descent is that it leverages the same step size (learning rate) for every input variable. AdaGrad, for short, is an extension of the gradient descent optimisation algorithm that enables the step size in every dimension leveraged by the optimisation algorithm to be automatically adapted on the basis of the gradients observed for the variable (partial derivatives) over the duration of the search.

Another restriction of AdaGrad is that it can have the outcome of a really small step size for every parameter by the conclusion of the search that can slow the progression of the search down too much and may mean not situating the optima.

Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and the AdaGrad variant of gradient descent that leverages a decaying average of partial gradients in the adaptation of the step size for every parameter. The leveraging of a decaying moving average enables the algorithm to forget early gradients and concentrate on the latest observed partial gradients observed during the course of the search, overcoming the restriction of AdaGrad.

In this guide, you will find out how to develop the gradient descent with RMSProp optimisation algorithm from scratch.

After going through this guide, you will be aware of:

• Gradient descent is an optimisation algorithm that leverages the gradient of the objective function to navigate the search space.
• Gradient descent can be updated to leverage an automatically step size for every input variable leveraging a decaying moving average of partial derivatives, referred to as RMSProp.
• How to go about implementing the RMSProp optimisation algorithm from the ground up and apply it to an objective function and assess the results.

Tutorial Summarization

This guide is subdivided into three portions, they are:

2] Root Mean Squared Propagation [RMSProp]

i, 2D Test Problem

ii, Gradient Descent Optimisation with RMSProp

iii, Visualization of RMSProp

Gradient descent is an optimisation algorithm.

It is technically referenced to as a first-order optimisation algorithm as it overtly leverages the first-order derivative of the target objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum.

The first-order derivative, or merely the ‘derivative’, is the rate of change or slope of the target function at a particular point, for example, for a particular input.

If the target function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate target function might also be taken as a vector and is referenced to typically as the gradient.

• Gradient: First order derivative for a multivariate objective function.

The derivative of the gradient indicates in the direction of the steepest ascent of the targeted function for a particular input.

Gradient descent is a reference to a minimisation optimisation algorithm that follows the negative of the gradient downhill of the targeted function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimised and the derivative function for the objective function. The target function f() gives back a score for a provided group of inputs, and the derivative function f’() provides the derivative of the target function for a provided set of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated and a step is taken within the input space that is expected to have the outcome of a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by initially calculating how far to shift within the input space, calculated as the step size (referred to as alpha or the learning rate) multiplied by the gradient. This is then removed from the current point, making sure we move against the gradient, or down the target function.

• x=x-step_size * f’(x)

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size (alpha): Hyperparameter that controls how far to shift within the search space against the gradient every iteration of the algorithm.

If the step size is too minimal, the movement in the search space will be small and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimisation algorithm, let’s take a peek at RMSProp.

Root Mean Squared Propagation (RMSProp)

Root Mean Squared Propagation, or RMSProp for short, is an extension to the gradient descent optimisation algorithm.

It is an unpublished extension, first detailed in Geoffrey Hinton’s lecture notes for his Coursera course regarding neural networks, particularly Lecture 6e entitled “rmsprop: Divide the gradient by a running average of its recent magnitude”

RMSProp is developed to speed up the optimisation process, for example, reduce the number of function evaluations needed to attain the optima, or to enhance the capacity of the optimisation algorithm, for instance, have the outcome of an improved final outcome.

AdaGrad is developed to particularly look into the idea of automatically tailoring the step size (learning rate) for every parameter within the search space. This is accomplished by first calculating a step size for a provided dimension, then leveraging the calculated step size to make a movement in that dimension leveraging the partial derivative. This procedure is then rinsed and repeated for every dimension within the search space.

Adagrad calculates the step size for every parameter by first totalling the partial derivatives for the parameter observed so far in the course of the search, then dividing the initial step size hyperparameter by the square root of the sum of the squared partial derivatives.

The calculation of the custom step size for one parameter is as follows:

• cust_step_size = step_size / (1e-8 + sqrt(s))

Where cust_step_size is the calculated step size for an input variable for a provided point in the course of the search, step_size is the initial step size, sqrt() is the square root operation and s is the total of the squared partial derivatives for the input variable observed during the search so far.

This has the impact of smoothing out the oscillations in the search for optimisation problems that have a ton of curvature within the search space.

AdaGrad shrinks the learning rate according to the cumulative history of the squared gradient and may have made the learning rate too small prior to coming to such a convex structure.

An issue with AdaGrad is that it can slow down the search down a lot, have the outcome of really small learning rates for every parameter of dimension of the search by the conclusion of the run. This has the impact of ceasing the search too soon, prior to the minimal being situated.

RMSProp extends AdaGrad to prevent the effect of a monotonically decreasing learning rate.

RMSProp can be perceived of as an extension of AdaGrad in that it leverages a decaying average or moving average of the partial derivatives rather than the sum in the calculation of the learning rate for every parameter.

This is accomplished by including a new hyperparameter we will refer to as rho that functions like momentum for the partial derivatives.

RMSProp maintains a decaying average of squared gradients.

Leveraging a decaying moving average of the partial derivative enables the search to forget early partial derivative values and concentrate on the most recently observed shape of the search space.

RMSPro leverages an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after identifying a convex bowl, as if it were an example of the AdaGrad algorithm initialized within that bowl.

The calculation of the mean squared partial derivative for one parameter is as follows:

• s(t+1) = (s(t) * rho) + (f’(x(t))^2 (1.0-rho))

Where s(t+1) is the decaying shifting average of the squared partial derivative and calculating the square root of this average provides the strategy its name, for example, square root of the mean squared partial derivatives or root mean square (RMS) For instance, the custom step size for a parameter may be written as:

• cust_step_size(t+1) = step_size / (1e-8 + RMS(s(t+1)))

After we have the custom step size for the parameter, we can update the parameter leveraging the custom step size and the partial derivative f’(x(t))

•  x(t+1) = x(t) – cust_step_size(t+1) * f’(x(t))

This procedure is then rinsed and repeated for every input variable till a new point in the search space is developed and can then be assessed.

RMSProp is a really effective extension of gradient descent and is one of the preferred strategies typically leveraged to fit deep learning neural networks.

Empirically, RMSProp has been demonstrated to be an efficient and practical optimisation algorithm for deep neural networks. It is presently one of the go-to optimisation methods being deployed frequently by deep learning practitioners.

Now that we are acquainted with the RMSProp algorithm, let’s look into how we might implement it and assess its performance.

In this portion of the blog article, we will look into how to go about implementing the gradient descent optimisation algorithm with adaptive gradients leveraging the RMSProp algorithm.

2D Test Problem

First, let’s give definition to an optimisation function.

We will leverage a simple 2D function that squares the input of every dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function:

 123 # objective functiondef objective(x, y):return x**2.0 + y**2.0

We can then develop a 3D plot of the dataset to obtain a feel for the curvature of the response surface.

The complete instance of plotting the objective function is detailed below:

 123456789101112131415161718192021222324 # 3d plot of the test functionfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputr_min, r_max = -1.0, 1.0# sample input range uniformly at 0.1 incrementsxaxis = arange(r_min, r_max, 0.1)yaxis = arange(r_min, r_max, 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a surface plot with the jet color schemefigure = pyplot.figure()axis = figure.gca(projection=’3d’)axis.plot_surface(x, y, results, cmap=’jet’)# show the plotpyplot.show()

Executing the instance creates a 3D dimensional surface plot of the objective function.

We can observe the familiar bowl shape with the global minima at f(0,0) = 0. We can also develop a 2D plot of the function. This will be beneficial later on when we wish to plot the progression of the search.

The instance below develops a contour plot of the objective function.

 1234567891011121314151617181920212223 # contour plot of the test functionfrom numpy import asarrayfrom numpy import arangefrom numpy import meshgridfrom matplotlib import pyplot # objective functiondef objective(x, y):return x**2.0 + y**2.0 # define range for inputbounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])# sample input range uniformly at 0.1 incrementsxaxis = arange(bounds[0,0], bounds[0,1], 0.1)yaxis = arange(bounds[1,0], bounds[1,1], 0.1)# create a mesh from the axisx, y = meshgrid(xaxis, yaxis)# compute targetsresults = objective(x, y)# create a filled contour plot with 50 levels and jet color schemepyplot.contourf(x, y, results, levels=50, cmap=’jet’)# show the plotpyplot.show()

Executing the instance develops a 2D contour plot of the objective function.

We can observe the bowl shape compressed to contours demonstrated with a colour gradient. We will leverage this plot to plot the particular points looked into during the progression of the search. Now that we possess a test objective function, let’s observe how we might go about implementing the RMSProp optimisation algorithm.

We can go about applying the gradient descent with RMSProp to the test problem.

To start with, we require a function that calculates the derivative for this function.

• f(x) = x^2
• f’(x) = x*2

The derivative of x^2 is x*2 in every dimension. The derivative() function implements this below.

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

Then, we can implement gradient descent optimisation.

To start with, we can choose an arbitrary point in the bounds of the problem as a beginning point for the search.

This makes the assumption that we possess an array that defines the bounds of the search with a single row for every dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

Then, we are required to initialize the decay average of the squared partial derivatives for every dimension to 0.0 values.

# list of the average square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape)]

We can then enumerate a static number of iterations of the search optimisation algorithm defined by a “n_iter” hyperparameter.

for it in range(n_iter):

The first step is to calculate the gradient for the current solution leveraging the derivative() function.

We are then required to calculate the square of the partial derivatives and update the decaying average of the squared partial derivatives with the “rho” hyperparameter.

# update the average of the squared partial derivatives

# update the moving average of the squared gradient

We can then leverage the moving average of the squared partial derivatives and gradient to calculate the step size for the next point.

We will perform this a single variable at a time, initially calculating the step size for the variable, then the new value for the variable. These values are built up in an array till we possess a totally new solution that is in the steepest descent direction from the current point leveraging the custom step sizes.

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

This new solution can then be assessed leveraging the objective() function and the performance of the search can be reported.

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution, solution)

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

And we’re done.

We can connect all of this data together into a function called rmsprop() that takes the names of the objective function and the derivative function, an array with the bounds of the domain and hyperparameter values for the cumulative number of algorithm iterations and the initial learning rate, and returns the final solution and its assessment.

The complete function is detailed here.

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the average square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape)]

for it in range(n_iter):

# update the average of the squared partial derivatives

# update the moving average of the squared gradient

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution, solution)

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return [solution, solution_eval]

We have intentionally leveraged lists and imperative coding style rather than vectorized operations for readability. You can adapt this implementation to a vectorization implementation with NumPy arrays for improved performance.

We can then go about defining the hyperparameters and call the rmsprop() function to optimise our test objective function.

In this scenario, we will leverage 50 iterations of the algorithm, an initial learning rate of 0.01, and a value of 0.99 for the rho hyperparameter, all selected following a bit of trial and error.

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Connecting all of this together, the complete instance of gradient descent optimisation with RMSProp is detailed here.

# gradient descent optimization with rmsprop for a two-dimensional test function

from math import sqrt

from numpy import asarray

from numpy.random import rand

from numpy.random import seed

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the average square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape)]

for it in range(n_iter):

# update the average of the squared partial derivatives

# update the moving average of the squared gradient

# build a solution one variable at a time

new_solution = list()

for i in range(solution.shape):

# calculate the step size for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

# store this variable

new_solution.append(value)

# evaluate candidate point

solution = asarray(new_solution)

solution_eval = objective(solution, solution)

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return [solution, solution_eval]

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

best, score = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

print(‘Done!’)

print(‘f(%s) = %f’ % (best, score))

Executing the instance applies the RMSProp optimisation algorithm to our test problem and reports the performance of the search for every iteration of the algorithm.

Your outcomes may demonstrate variance provided the stochastic nature of the algorithm or evaluation procedure, or variations in numerical accuracy. Take up running the instance a few times and contrast the average outcome.

In this scenario, we can observe that a near optimal solution was discovered after probably 33 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0

>30 f([-9.61030898e-14 3.19352553e-03]) = 0.00001

>31 f([-3.42767893e-14 2.71513758e-03]) = 0.00001

>32 f([-1.21143047e-14 2.30636623e-03]) = 0.00001

>33 f([-4.24204875e-15 1.95738936e-03]) = 0.00000

>34 f([-1.47154482e-15 1.65972553e-03]) = 0.00000

>35 f([-5.05629595e-16 1.40605727e-03]) = 0.00000

>36 f([-1.72064649e-16 1.19007691e-03]) = 0.00000

>37 f([-5.79813754e-17 1.00635204e-03]) = 0.00000

>38 f([-1.93445677e-17 8.50208253e-04]) = 0.00000

>39 f([-6.38906842e-18 7.17626999e-04]) = 0.00000

>40 f([-2.08860690e-18 6.05156738e-04]) = 0.00000

>41 f([-6.75689941e-19 5.09835645e-04]) = 0.00000

>42 f([-2.16291217e-19 4.29124484e-04]) = 0.00000

>43 f([-6.84948980e-20 3.60848338e-04]) = 0.00000

>44 f([-2.14551097e-20 3.03146089e-04]) = 0.00000

>45 f([-6.64629576e-21 2.54426642e-04]) = 0.00000

>46 f([-2.03575780e-21 2.13331041e-04]) = 0.00000

>47 f([-6.16437387e-22 1.78699710e-04]) = 0.00000

>48 f([-1.84495110e-22 1.49544152e-04]) = 0.00000

>49 f([-5.45667355e-23 1.25022522e-04]) = 0.00000

Done!

f([-5.45667355e-23 1.25022522e-04]) = 0.000000

Visualization of RMSProp

We can plot the progression of the search on a contour plot of the domain.

This can furnish an intuition for the progress of the search over the iterations of the algorithm.

We must go about updating the rmsprop() function to maintain a listing of all solutions discovered during the search, then return this list at the conclusion of the search.

The updated variant of the function with these modifications is provided here.

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the average square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape)]

for it in range(n_iter):

# update the average of the squared partial derivatives

# update the moving average of the squared gradient

# build solution

new_solution = list()

for i in range(solution.shape):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution, solution)

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return solutions

We can then carry out the search as prior, and this time retrieve the list of solutions rather than the ideal final solution.

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

We can then develop a contour plot of the objective function as prior.

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

Lastly, we can plot every solution discovered over the course of the search as a white dot connected by line.

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

Connecting all of this together, the complete instance of performing the RMSProp optimisation on the test problem and plotting the outcomes on a contour plot is detailed below.

# example of plotting the rmsprop search on a contour plot of the test function

from math import sqrt

from numpy import asarray

from numpy import arange

from numpy.random import rand

from numpy.random import seed

from numpy import meshgrid

from matplotlib import pyplot

from mpl_toolkits.mplot3d import Axes3D

# objective function

def objective(x, y):

return x**2.0 + y**2.0

# derivative of objective function

def derivative(x, y):

return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with rmsprop

def rmsprop(objective, derivative, bounds, n_iter, step_size, rho):

# track all solutions

solutions = list()

# generate an initial point

solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] – bounds[:, 0])

# list of the average square gradients for each variable

sq_grad_avg = [0.0 for _ in range(bounds.shape)]

for it in range(n_iter):

# update the average of the squared partial derivatives

# update the moving average of the squared gradient

# build solution

new_solution = list()

for i in range(solution.shape):

# calculate the learning rate for this variable

alpha = step_size / (1e-8 + sqrt(sq_grad_avg[i]))

# calculate the new position in this variable

value = solution[i] – alpha * gradient[i]

new_solution.append(value)

# store the new solution

solution = asarray(new_solution)

solutions.append(solution)

# evaluate candidate point

solution_eval = objective(solution, solution)

# report progress

print(‘>%d f(%s) = %.5f’ % (it, solution, solution_eval))

return solutions

# seed the pseudo random number generator

seed(1)

# define range for input

bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])

# define the total iterations

n_iter = 50

# define the step size

step_size = 0.01

# momentum for rmsprop

rho = 0.99

# perform the gradient descent search with rmsprop

solutions = rmsprop(objective, derivative, bounds, n_iter, step_size, rho)

# sample input range uniformly at 0.1 increments

xaxis = arange(bounds[0,0], bounds[0,1], 0.1)

yaxis = arange(bounds[1,0], bounds[1,1], 0.1)

# create a mesh from the axis

x, y = meshgrid(xaxis, yaxis)

# compute targets

results = objective(x, y)

# create a filled contour plot with 50 levels and jet color scheme

pyplot.contourf(x, y, results, levels=50, cmap=’jet’)

# plot the sample as black circles

solutions = asarray(solutions)

pyplot.plot(solutions[:, 0], solutions[:, 1], ‘.-‘, color=’w’)

# show the plot

pyplot.show()

Executing the instance performs the search as prior, except in this scenario, the contour plot of the objective function is developed.

In this scenario, we can observe that a white dot is demonstrated for every solution discovered during the search, beginning above the optima and progressively getting nearer to the optima at the centre of the plot. Papers

Lecture 6e, rmsprop: Divide the gradient by a running average of its latest magnitude, Neural Networks for Machine Learning, Geoffrey Hinton

Books

Algorithms for optimisation, 2019

Deep learning, 2016

APIs

numpy.random.rand API

numpy.asarray API

Matplotlib API

Articles