An intro to recurrent neural networks and the math that drives it
With regards to sequential or time series data, conventional feedforward networks can’t be leveraged for learning and forecasting/prediction. A mechanism is needed that can retain historical data to predict the future values. Recurrent neural networks or RNNs in short are a variety of the traditional feedforward artificial neural networks that can handle sequential data and can be trained to retain the know-how, from a historical perspective.
After going through this guide, you will be aware of:
- Recurrent neural networks
- What is meant by unfolding an RNN
- How weights are updated in an RNN
- Several RNN architectures
This tutorial is subdivided into two portions, which are:
- The working of an RNN
- Unfolding in time
- Backpropagation through time algorithm
- Differing RNN architectures and variants
For this guide, the assumption is that you are already acquainted with artificial neural networks and the back propagation algorithm. This guide also details how gradient based back propagation algorithm is leveraged in training a neural network.
What is a Recurrent Neural Network
A recurrent neural network (RNN) is a special variant of artificial neural network adapted to work for time series data or data that consists of sequences. Traditional feed forward neural networks are just intended for data points, which are independent of each other. But, if we possess data in a sequence that a single data point is dependent on the prior data point, we are required to alter the neural network to integrate the dependencies amongst these data points. RNNs have the notion of ‘memory’ that assists them in recording the states or data of prior inputs to produce the next output of the sequence.
Unfolding A Recurrent Neural Network
A simple RNN possesses a feedback loop as displayed in the first diagram of the above image. The feedback loop displayed in the gray rectangle can be unrolled in 3 time steps to generate the second network of the above figure. Obviously, you can induce variance to the architecture so that the network unrolls k time steps. In the image, the following notation is leveraged:
At each time step we can unfold the network for k time steps to obtain the output at time step k + 1. The unfolded network is very much like to the feedforward neural network. The rectangle in the unfolded network displays an operation that is happening. So for instance, with an activation function f:
ht + 1 = f(xt, ht, wx, wh, bn) = f(wxxt + whht + bn)
The output y at time t is computed as:
yt = f(ht, wy) = f (wy · ht + by)
Here, · is the dot product.
Therefore, in the feedforward pass of a RNN, the network computes the values of the hidden units and the output upon k time steps. The weights connected with the network are shared temporally. Every recurrent layer has dual sets of weights, one for the input and the second one for the hidden unit. The final feedforward layer, which computes the final output for the kth timestep is much like a conventional layer of a conventional feedforward network.
The Activation Function
We can leverage any activation function we desire in the recurrent neural network. Common options are:
Training a Recurrent Neural Network
The backpropagation algorithm of an artificial neural network is altered to integrate the unfolding in time to train the weights of the network. This algorithm has its basis on computation of the gradient vector and is referred to as back propagation in time or BPTT algorithm for short. The pseudo-code with regards to training is provided here. The value of k can be chosen by the user with regards to training. In the pseudo-code below pt is the targeted value at time step t.
- Repeat until stopping criterion is met.
- Set all h to zero.
- Repeat for t = 0 to n-k
- Forward propagate the network over the unfolded network for k time steps to compute all h and y.
- Compute the error as:
- Backpropagate the error throughout the unfolded network and go about updating the weights.
Variants of RNNs
There are differing variants of recurrent neural networks with varying architectures. A few examples are:
Here there is a single (xt, yt) pair. Conventional neural networks deploy a one-to-one architecture.
One to many
In one to several networks, a singular input at xt, can generate several outputs, e.g. (yt0, yt1,yt2)
Music generation is an instance area, where one to several networks are deployed.
Many to One
In this scenario, several inputs from differing time steps generate a singular output. For instance (xt , xt + 1, xt+2) can generate a singular output yt. Such networks are deployed in sentiment analysis or emotion detection, where the class label is dependent upon a sequence of words.
Many to Many
There are several potential for many to many. An instance is displayed above, where dual inputs generate three outputs. Many to many networks are applied in machine translation, for example, English to French or vice versa translation systems.
Benefits and drawbacks with regards to RNNs
RNNs contain several benefits like:
- Capacity to manage sequence data
- Capacity to manage inputs of variable lengths.
- Ability to record or ‘memorize’ historical data.
The drawbacks are:
- The computation can be really slow.
- The network does not enter into consideration future inputs to make decisions.
- Vanishing gradient problem, where the gradients leveraged to compute the weight update might get really close to zero averting the network from learning fresh weights. The deeper the network, the more significant is this issue.
Differing RNN architectures
There are differing variations of RNNs that are being applied in practice within machine learning problems:
Bidirectional recurrent neural networks (BRNN)
In BRNN, inputs from future time steps are leveraged to enhance the precision of the network. It is like possessing know-how of the first and last words of a sentence to forecast the middle words.
Gated Recurrent Units (GRU)
These networks are developed to manage the vanishing gradient problem. They possess a reset and update gate. These gates decide which data is to be retained for subsequent forecasts.
Long Short Term Memory (LSTM)
LSTMs were additionally developed to tackle the vanishing gradient problem in RNNs. LSTM leverages a trio of gates referred to as input, output, and forget gate. Just like GRU, these gates decide which data to retain.
This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper.
- Deep Learning Essentials, by Wei Di, Anurag Bharadwaj and Jianjing Wei
- Deep Learning by Ian Goodfellow, Joshua Bengio and Aaron Courville
In this guide, you found out all about recurrent neural networks and their several architectures.
Particularly, you learned:
- How a recurrent neural network manages sequential data
- Unfolding in time in a recurrent neural network
- What is back propagation in time
- Advantages and Disadvantages of RNNs
- Several architectures and variants of RNN