The attention mechanism was put forth to enhance the performance of the encoder-decoder model for machine translation. The concept underlying the attention mechanism was to enable the decoder to utilize the most appropriate parts of the input sequence in a flexible manner, by a weighted combo of all of the encoded input vectors, with the most appropriate vectors being attributed the highest weights.
In this guide, you will find out about the attention mechanism and its implementation.
After going through this guide, you will be aware of:
- How the attention mechanism leverages a weighted sum of all of the encoder hidden states to flexibly concentrate the focus of the decoder to the most appropriate parts of the input sequence.
- How the attention mechanism can be generalized for activities where the data might not necessarily be connected in a sequential fashion.
- How to implement the general attention mechanism in Python with NumPy and SciPy.
Tutorial Summarization
This tutorial is divided into three portions, which are:
- The Attention Mechanism
- The General Attention Mechanism
- The General Attention Mechanism with NumPy and SciPy
The Attention Mechanism
The attention mechanism was put forth by Bahdanau et al. (2014), to tackle the bottleneck problem that props up with the leveraging of a static-length encoding vector, where the decoder would have restricted access to the data furnished by the input. This is thought to turn particularly problematic for long and/or complicated sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.
We had observed that Bahdanau et al.’s attention mechanism is divided into the step-by-step computations of the alignment scores, the weights and the context vector:
But, the attention mechanism can be re-formulated into a generic form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) activity, where the data may not especially be connected in a sequential fashion.
In other words, the dataset doesn’t have to comprise of the hidden RNN states at differing steps, but consist any kind of data instead.
The General Attention Mechanism
The general attention mechanism leverages three primary components, specifically the queries, Q, the keys K, and the values, V.
If we had to contrast this trio of components to the attention mechanism as put forth by Bahdanau et al., then the query would be analogous to the prior decoder output, st-1, while the values would be analogous to the encoded inputs, hi. In the Bahdanau attention mechanism, the keys and values are the same vector.
In this scenario, we can think of the vector st-1 as a query carried out against a database of key-value pairs, where the keys are vectors and the hidden states hi, are the values.
The general attention mechanism then performs the following computations:
Within the context of machine translation, every word in an input sequence would be attributed its own query, key and value vectors. These vectors are produced by multiplication of the encoder’s representation of the particular word under consideration, with three differing weight matrices that would have been produced during training.
Basically, when the generalized attention mechanism is put forth with a sequence of words, it takes the query vector attributed to some particular word in the sequence and scores it against every key in the database. In doing so, it generates an attention output for the word being considered.
The General Attention Mechanism with NumPy and SciPy
In this portion of the blog, we will look into how to implement the general attention mechanism leveraging the NumPy and SciPy libraries in Python.
For the sake of simplicity, we will, to start with, calculate the attention for the first word in a sequence of four. We will then generalize the code to calculate an attention output for all four words in matrix form.
Therefore, let’s begin by initially defining the word embeddings of the four differing words for which we will be calculating the attention. Practically speaking, these word embeddings would have been produced by an encoder, but for this particular instance we shall be defining them manually.
1 2 3 4 5 | # encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1]) |
The next step produces the weight matrices, which we will eventually be multiplying to the word embeddings to produce the queries, keys and values. Here, we shall be producing these weight matrices randomly, although in actual practice these would have been learned during the course of training.
1 2 3 4 5 6 | … # generating the weight matrices random.seed(42) # to allow us to reproduce the same attention values W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3)) |
Observe how the number of rows of each of these matrices is equivalent to the dimensionality of the word embeddings (which in this scenario is three) to enable us to carry out the matrix multiplication.
Subsequently, the query, key, and value vectors for every word are produced by multiplying every word embedding by each of the weight matrices.
[Control]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | … # generating the queries, keys and values query_1 = word_1 @ W_Q key_1 = word_1 @ W_K value_1 = word_1 @ W_V
query_2 = word_2 @ W_Q key_2 = word_2 @ W_K value_2 = word_2 @ W_V
query_3 = word_3 @ W_Q key_3 = word_3 @ W_K value_3 = word_3 @ W_V
query_4 = word_4 @ W_Q key_4 = word_4 @ W_K value_4 = word_4 @ W_V |
Considering only the starting word for the time being, the next step scores its query vector against all of the key vectors leveraging a dot product operation.
1 2 3 | … # scoring the first query vector against all key vectors scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot( |
The score values are subsequently passed via a softmax operation to produce the weights. Prior to doing so, it is usual practice to divide the score values by the square root of the dimensionality of the key vectors (in this scenario, three), to retain the stability of the gradients.
1 2 3 | … # computing the weights by a softmax operation weights = softmax(scores / key_1.shape[0] ** 0.5) |
Lastly, the attention output is calculated by a weighted sum of all four value vectors.
[Control]
1 2 3 4 5 | … # computing the attention by a weighted sum of the value vectors attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)
print(attention) |
[0.98522025 1.74174051 0.75652026]
For quicker processing, the same calculations can be implemented in matrix form to produce an attention output for all words in one go.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from numpy import array from numpy import random from numpy import dot from scipy.special import softmax
# encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1])
# stacking the word embeddings into a single array words = array([word_1, word_2, word_3, word_4])
# generating the weight matrices random.seed(42) W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3))
# generating the queries, keys and values Q = words @ W_Q K = words @ W_K V = words @ W_V
# scoring the query vectors against all key vectors scores = Q @ K.transpose()
# computing the weights by a softmax operation weights = softmax(scores / K.shape[1] ** 0.5, axis=1)
# computing the attention by a weighted sum of the value vectors attention = weights @ V
print(attention) |
[Control]
1 2 3 4 | [[0.98522025 1.74174051 0.75652026] [0.90965265 1.40965265 0.5 ] [0.99851226 1.75849334 0.75998108] [0.99560386 1.90407309 0.90846923]] |
Further Reading
The section furnishes additional resources on the subject if you are seeking to delve deeper.
Books
- Advanced Deep Learning with Python, 2019
- Deep Learning Essentials, 2018
Papers
- Neural Machine Translation by Jointly Learning to Align and Translate, 2014
Conclusion
In this guide, you found out about the attention mechanism and its implementation.
Particularly, you learned about:
- How the attention mechanism leverages a weighted sum of all of the encoder hidden states to flexibly concentrate the attention of the decoder to the most appropriate parts of the input sequence.
- How the attention mechanism can be generalized for activities where the data might not necessarily be connected in a sequential fashion.
- How to implement the general attention mechanism with NumPy and SciPy.
The Attention Mechanism from the ground up
The attention mechanism was put forth to enhance the performance of the encoder-decoder model for machine translation. The concept underlying the attention mechanism was to enable the decoder to utilize the most appropriate parts of the input sequence in a flexible manner, by a weighted combo of all of the encoded input vectors, with the most appropriate vectors being attributed the highest weights.
Emergent tech that are poised to drive the financial and banking domain in the very near future Part 2
Ranging from client service to software bot bankers, innovative emergent technologies such as artificial intelligence (AI), blockchain, and robotics are revolutionizing the financial services space. This article will help you in obtaining deeper insight on the proliferation and acceptance trends of such tech.
Emergent tech that are poised to drive the financial and banking domain in the very near future Part 1
This blog by AICorespot goes into detail on the leading five emergent technologies and trends (for the 2020s and beyond) within the banking/financial services domain like cloud adoption, cloud banking, BPaas, Cybersecurity and Instant Payment.
ML in Healthcare Twelve-Use Cases Part 6
In the previous part of this multi-part blog series by AICorespot, we saw issues related to compliance and machine learning within healthcare. We will look at this matter further in this part of the blog, additionally, we will also speak about potential use cases AI/ML can tackle, and additionally the outlook for the tech within the healthcare domain.
Caption Generation with the Inject and Merge Encoder-Decoder Models
Caption generation is a challenging artificial intelligence problems that draws inspiration on both computer vision and natural language processing. The encoder-decoder recurrent neural network architecture has been demonstrated has been demonstrated to be efficient at this problem. The implementation of this architecture can be distilled into inject and merge based models, and both make differing assumptions about the role of the recurrent neural networks in tackling the problem.
How to leverage XGBoost for Time Series Forecasting
XGBoost is an efficient implementation of gradient boosting for classification and regression problems. It is both quick and effective, featuring good performance, if not top-of-the-line, on a broad array of forecasting modelling activities and is popular amongst data science contest winners, like those on Kaggle.
Time Series Prediction with deep learning in Keras
Time series prediction is a tough problem both to frame and to tackle within machine learning. In this blog article by AICorespot, you will find out how to develop neural network models for time series prediction in Python leveraging the Keras deep learning library.
Time Series Forecasting with Prophet in Python
Time series forecasting can be a challenge as there are several differing strategies you could leverage and several differing hyperparameters for every strategy. The Prophet Library is an open-source library developed for making predictions for univariate time series datasets. It is simple to leverage and developed to automatically identify a good set of hyperparameters for the model in an attempt to make skilful predictions for data with trends and seasonal structure by default.
How to leverage Regression Machine Learning Algorithms in Weka
Weka has a massive number of regression algorithms available on the platform. The massive number of machine learning algorithms assisted by Weka is one of the biggest advantages of leveraging the problem. In this blog article you will find out how to leverage top regression machine learning algorithms in Weka.
How to manage missing values in machine learning data with Weka
Data is very uncommonly clean and typically you can have corrupt or absent values. It is critical to detect, mark, and manage missing data when developing machine learning models in order to obtain the optimal performance. In this blog article, you will find out how to manage absent values in your machine learning data leveraging Weka.