### Gradient descent with Adadelta from the ground up

Gradient descent is an optimization algorithm that adheres to the negative gradient of an objective function in order to situate the minimum of the function.

A restriction of gradient descent is that it leverages the same step size (learning rate) for every input variable. AdaGradn and RMSProp are extensions to gradient descent that include a self-adaptive learning rate for every parameter for the objective function.

Adaelta can be viewed as a further extension of gradient descent that adds upon AdaGrad and RMSProp and alters the calculation of the customized step size so that the units are consistent and in turn no more needs an initial learning rate hyperparameter.

In this guide, you will find out how to develop the gradient descent with Adadelta optimization algorithm from the ground up.

After going through this guide, you will be aware of:

- Gradient descent is an optimization algorithm that leverages the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to leverage an automatically adaptive step size for every input variable leveraging a decaying average of partial derivatives, referred to as Adadelta.
- How to implement the Adadelta optimization algorithm from the ground up and apply it to an objective function and assess the outcome.

**Tutorial Summarization**

This tutorial is subdivided into three portions, which are:

1] Gradient Descent

2] Adadelta Algorithm

3] Gradient Descent With Adadelta

- 2D Test Problem
- Gradient Descent Optimization withAdadelta
- Visualization of adadelta

**Gradient Descent**

Gradient descent is an optimization algorithm.

It is technically referenced to as a first-order optimization algorithm as it overtly makes use of the first-order derivative of the target objective function.

First-order methods are reliant on gradient data to assist in directing the search for a minimum…

The first order derivative, or merely the ‘derivative’ is the rate of change or slope of the target function at a particular point, for example, for a particular input.

If the targeted function takes several input variables, it is referenced to as a multivariate function and the input variables can be perceived of as a vector. In turn, the derivative of a multivariate target function might also be taken as a vector and is referenced to typically as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a particular input.

Gradient descent references to a minimization optimization algorithm that adheres to the negative of the gradient downhill of the target function to situate the minimum of the function.

The gradient descent algorithm needs a target function that is being optimized and the derivative function for the objective function. The target function f() gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided set of inputs.

The gradient descent algorithm needs a target function that is being optimized and the derivative function for the objective function. The target function f() gives back a score for a provided grouping of inputs, and the derivative function f’() provides the derivative of the target function for a provided set of inputs.

The gradient descent algorithm needs a beginning point (x) in the problem, like an arbitrarily chosen point within the input space.

The derivative is then calculated and a step is taken within the input space that is expected to have the outcome of a downhill movement in the target function, under the assumption that we are minimizing the target function.

A downhill movement is made by initially calculating how far to shift in the input space, calculated as the steps size (referred to as Alpha or the learning rate) multiplied by the gradient. This is then subtracted from the present point, making sure we shift against the gradient, or down the target function.

x=x-step_size*f’(x)

The steeper the objective function at a provided point, the bigger the magnitude of the gradient, and in turn, the bigger the step taken within the search space. The size of the step taken is scaled leveraging a step size hyperparameter.

Step Size (alpha): Hyperparameter that manages how far to shift in the search space against the gradient every iteration of the algorithm.

If the step size is too minimal, the movement within the search space will be small and the search will take a protracted time. If the step size is too big, the search might bounce around the search space and skip over the optima.

Now that we are acquainted with the gradient descent optimization algorithm, let’s take a look at Adadelta.

**Adadelta Algorithm**

Adadelta (or “ADADETLTA”) is an extension to the gradient descent optimization algorithm.

The algorithm was detailed in the 2012 paper by Matthew Zeiler titled “ADADELTA: An Adaptive Learning Rate Method.”

Adadelta is developed to accelerate the optimization procedure, e.g. decrease the number of function assessments needed to reach the optima, or to enhance the capability of the optimization algorithm, for example, have the outcome of an improved final outcome.

It is better comprehended as an extension of the AdaGrad and RMSProp algorithms.

AdaGrad is an extension of gradient descent that quantifies a step size (learning rate) for every parameter for the objective function every time an update is made. The step size is calculated by first totalling the partial derivatives for the parameter seen so far during the search, then dividing the initial step size hyperparameter through the square root of the sum of the squared partial derivatives.

The calculation of the custom step size for a single parameter with AdaGrad is as follows:

- cust_step_size(t+1) = step_size / (1e-8 + sqrt(s(t)))

Where cust_step_size(t+1) is the quantified step size for an input variable for a provided point during the search, step_size is the initial step size, sqrt() is the square root operation, and s(t) is the total of the squared partial derivatives for the input variable observed during the search so far (this includes the present iteration)

RMSProp can be perceived of as an extension of AdaGrad in that it leverages a decaying average or moving average of the partial derivatives rather than the sum in the calculation of the step size for every parameter. This is accomplished by including a new hyperparameter “rho” that functions like a momentum for the partial derivatives.

The calculation of the decaying moving average squared partial derivative for a single parameter is as follows:

- s(t+1) = (s(t) * rho) + (f’(x(t))^2 * (1.0-rho))

Where s(t+1) is the mean squared partial derivative for a single parameter for the present iteration of the algorithm s(t) is the decaying shifting average squared partial derivatives for the prior iteration, f’(x(t))^2 is the squared partial derivative for the present parameter, and rho is a hyperparameter, usually with the value of 0.9 like momentum.

Adadelta is a further extension of RMSProp developed to enhance the convergence of the algorithm and to remove the requirement for a manually specified preliminary learning rate.

The idea put forth in this paper was derived from ADAGRAD in order to enhance upon the two primary restrictions of the method: 1) the continual decay of learning rates through training, and 2) the requirement for a manually chosen global learning rate.

The decaying moving average of the squared partial derivatives is calculated for every parameter, as with RMSProp. The critical difference is in the calculation of the step size for a parameter that leverages the decaying average of the delta or modification in parameter.

This choice of numerator was to ensure that both portions of the calculation have the same units.

After independently obtaining the RMSProp update, the authors noticed that the units in the update equations for gradient descent, momentum and Adagrad do not match. To rectify this, they leverage an exponentially decaying average of the square updates.

First, the customized step size is quantified as the square root of the decaying shifting average of the modification in the delta divided by the square root of the decaying moving average of the squared partial derivatives.

- cust_step_size(t+1) = (ep + sqrt(delta(t))) / (ep + sqrt(s(t)))

Where cust_step_size(t+1) is the customized step size for a parameter for a provided update, ep is a hyperparameter that is added to the numerator and denominator to avoid a divide by zero error, delta(t) is the decaying shifting average of the squared change to the parameter (quantified in the last iteration) and s(t) is the decaying moving average of the squared partial derivative (calculated in the present iteration)