Calculus in practice – Neural Networks
An artificial neural network is a computational model that goes about approximating a mapping amongst inputs and outputs.
It draws inspiration from the structure of the human brain, in it that it is similarly made up of a network of interconnected neurons that propagate data upon obtaining sets of stimuli from neighbouring neurons.
Training of a neural algorithm consists of a process that deploys the backpropagation and gradient descent algorithms in tandem. As we will be observing, both of these algorithms comprehensively leverage calculus.
In this guide, you will find out about how facets of calculus are applied in neural networks.
After going through this guide, you will be aware of:
- An artificial neural network is categorized into layers of neurons and connections, where the latter are attributed a weight value each.
- Every neuron implements a nonlinear function that maps a grouping of inputs to an output activation.
- When training a neural network, calculus is leveraged comprehensively by the backpropagation and gradient descent algorithms.
This tutorial is subdivided into three portions, which are:
- An intro to the neural network
- The mathematics of a neuron
- Training the network
For this guide, it is assumed that you already possess knowledge of:
- Function approximation
- Rate of change
- Partial derivatives
- The chain rule
- The chain rule on additional functions
- Gradient descent
An introduction to the neural network
Artificial neural networks can be taken up as a function approximation algorithms.
Within a supervised learning scenario, when presented with several input observations signifying the issue of interest, combined with their correlating target outputs, the artificial neural network will look to approximate the mapping that exists among the two.
A neural network can be defined as a computational model that draws inspiration from the structure of the human brain.
The human brain is made up of a massive network of interconnected neurons (approximately one hundred billion of them), with every one consisting of a cell body, a grouping of fibres referred to as dendrites, and an axon:
The dendrites function as the input channels to a neuron, whereas the axon functions as the output channel. Thus, a neuron would obtain input signals via its dendrites, which in turn would be linked to the (output) axons of other neighbouring neurons. In this fashion, an adequately robust electrical pulse (also referred to as an action potential) can be transmitted along the axon of a single neuron, to all the other neurons that are linked to it. This enables signals to be propagated along the structure of the human brain.
Therefore a neuron functions as an all-or-nothing switch, that takes in a grouping of inputs and either outputs an action potential or no output.
An artificial neural network is analogous to the makeup of the human brain, as it is similarly made up of a large number of linked neurons that, look to propagate data across the network by, (3) obtaining groupings of stimuli from neighbouring neurons and mapping these to outputs, to be fed to the subsequent layer of neurons.
The makeup of an artificial neural network is usually made up of layers of neurons. For instance, the image below depicts a completely-connected neural network, where all the neurons in a single layer are linked to the neurons in the subsequent layer.
The inputs are put forth on the left hand side of the network, and the data propagates (or flows) rightward in the direction of the outputs at the opposite end. As the data is, hereby, propagating in the forward direction via the network, then we would also reference to such a network as a feedforward neural network.
The layers of neurons in between the input and output layers are referred to as hidden layers, as they are not directly accessible.
Every connection (indicated by an arrow within the diagram) between two neurons is attributed a weight, which functions on the information coursing through the network, as we will observe shortly.
The Mathematics of a Neuron
More particularly, let’s state that a specific artificial neuron (or a perceptron, as Frank Rosenblatt had initially named it) receives n inputs, [x1, …, xn] where every connection is attributed a correlating weight, [w1, …, wn]
The first operation that is carried out multiplies the input values by their associated weight, and includes a bias term, b, to their sum, generating an output, z:
z = ((x1 × w1) + (x2 × w2) + … + (xn × wn)) + b
We can alternatively, signify this operation in a more concise form as follows:
This weighted sum calculation that we have carried out thus far is a linear operation. If each neuron had to implement this specific calculation alone, then the neural network would be limited to learning just linear input-output mappings.
But, several of the relationships in the world that we might wish to model are nonlinear, and if we make an effort to model these relationships leveraging a linear model, then the model will not be very precise.
Therefore a second operation is carried out by every neuron that transforms the weighted sum by the application of a nonlinear activation function, a(.).
We can signify the operations carried out by every neuron even more concisely, if we had to integrate the bias term into the sum as another weight, w0 (observe that the sum now begins from 0):
The operations carried out by every neuron can be demonstrated as follows:
Thus, every neuron can be taken up to implement a nonlinear function that maps a grouping of inputs to an output activation.
Training the network
Training an artificial neural network consists of the procedure of searching for the grouping of weights that model best the patterns within the data. It is a procedure that employs the backpropagation and gradient descent algorithms simultaneously. Both of these algorithms comprehensively leverage calculus.
Every time that the network is traversed in the forward (or rightward) direction, the error of the network can be calculated as the difference between the output generated by the network and the predicted ground truth, by means of a loss function (like the sum of squared errors (SSE)). The backpropagation algorithm, then, calculates the gradient (or the rate of change) of this error to alterations in the weights. In order to do this, it needs the leveraging the chain rule and partial derivatives.
For simplicity, take up a network consisted of dual neurons linked by a singular path of activation. If we had to break them open, we would identify that the neurons perform the following operations within cascade:
The first application of the chain rule links the overall error of the network to the input, z2 of the activation function a2 of the second neuron, and subsequently to the weight, w2, as follows:
You might observe that the application of the chain rule consists of, among other terms, a multiplication by the partial derivative of the neuron’s activation function with regards to its input z2. There are differing activation functions to select from, like the sigmoid or the logistic functions. If we had to take the logistic function as an instance, then its partial derivative would be computed as follows:
Therefore, we can compute 𝛿2 as follows:
Here, t2 is the predicted activation in identifying the difference between t2 and a2 we are, thus, computing the error amongst the activation produced by the network and the predicted ground truth.
As we are computing the derivative of the activation function, it should, thus, be continuous and differentiable over the complete space of real numbers. Within the scenario of deep neural networks, the error gradient is propagated backwards over a massive number of hidden layers. This can make the error signal to swiftly diminish to zero, particularly if the maximum value of the derivative function is already small to start with (for example, the inverse of the logistic function has a max value of 0.25). This is referred as the vanishing gradient problem. The ReLU function has been so widely leveraged within deep learning to alleviate this issue, as its derivative in the positive portion of its domain is equivalent to 1.
The next weight backwards is deeper into the network, and therefore, the application of the chain rule can likewise be extended to connect the overall error to the weight w1 as follows:
If we take the logistic function again as the activation function of choice, then we would compute 𝛿1 as follows:
After we have computed the gradient of the network error with regard to every weight, then the gradient descent algorithm can be applied to update every weight for the next forward propagation at time, t+1. For the weight, w1 the weight update rule leveraging gradient descent would be mentioned as follows:
Although we have hereby taken up a simplistic network, the procedure that we have gone through can be extended to assess more complicated and deeper ones, such convolutional networks (CNNs)
If the network being considered is characterized by several branches coming from several inputs (and potentially following towards several outputs), then its assessment would consist of the summation of differing derivative chains for every path, similarly to how we have prior derived the generalized chain rule.
This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper.
Deep Learning, 2019.
Pattern Recognition and Machine Learning, 2016.
In this guide, you found out how facets of calculus are applied within neural networks.
Particularly, you learned:
- An artificial neural network is organized into layers of neurons and connections, where the latter are each attributed a weight value.
- Every neuron implements a nonlinear functions that maps a grouping of inputs to an output activation.
- When training a neural network, calculus is leveraged extensively by the backpropagation and gradient descent algorithms.