>Business >The Attention Mechanism from the ground up

### The Attention Mechanism from the ground up

The attention mechanism was put forth to enhance the performance of the encoder-decoder model for machine translation. The concept underlying the attention mechanism was to enable the decoder to utilize the most appropriate parts of the input sequence in a flexible manner, by a weighted combo of all of the encoded input vectors, with the most appropriate vectors being attributed the highest weights.

In this guide, you will find out about the attention mechanism and its implementation.

After going through this guide, you will be aware of:

• How the attention mechanism leverages a weighted sum of all of the encoder hidden states to flexibly concentrate the focus of the decoder to the most appropriate parts of the input sequence.
• How the attention mechanism can be generalized for activities where the data might not necessarily be connected in a sequential fashion.
• How to implement the general attention mechanism in Python with NumPy and SciPy.

Tutorial Summarization

This tutorial is divided into three portions, which are:

• The Attention Mechanism
• The General Attention Mechanism
• The General Attention Mechanism with NumPy and SciPy

The Attention Mechanism

The attention mechanism was put forth by Bahdanau et al. (2014), to tackle the bottleneck problem that props up with the leveraging of a static-length encoding vector, where the decoder would have restricted access to the data furnished by the input. This is thought to turn particularly problematic for long and/or complicated sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.

We had observed that Bahdanau et al.’s attention mechanism is divided into the step-by-step computations of the alignment scores, the weights and the context vector:

But, the attention mechanism can be re-formulated into a generic form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) activity, where the data may not especially be connected in a sequential fashion.

In other words, the dataset doesn’t have to comprise of the hidden RNN states at differing steps, but consist any kind of data instead.

The General Attention Mechanism

The general attention mechanism leverages three primary components, specifically the queries, Q, the keys K, and the values, V.

If we had to contrast this trio of components to the attention mechanism as put forth by Bahdanau et al., then the query would be analogous to the prior decoder output, st-1, while the values would be analogous to the encoded inputs, hi. In the Bahdanau attention mechanism, the keys and values are the same vector.

In this scenario, we can think of the vector st-1 as a query carried out against a database of key-value pairs, where the keys are vectors and the hidden states hi, are the values.

The general attention mechanism then performs the following computations:

Within the context of machine translation, every word in an input sequence would be attributed its own query, key and value vectors. These vectors are produced by multiplication of the encoder’s representation of the particular word under consideration, with three differing weight matrices that would have been produced during training.

Basically, when the generalized attention mechanism is put forth with a sequence of words, it takes the query vector attributed to some particular word in the sequence and scores it against every key in the database. In doing so, it generates an attention output for the word being considered.

The General Attention Mechanism with NumPy and SciPy

In this portion of the blog, we will look into how to implement the general attention mechanism leveraging the NumPy and SciPy libraries in Python.

For the sake of simplicity, we will, to start with, calculate the attention for the first word in a sequence of four. We will then generalize the code to calculate an attention output for all four words in matrix form.

Therefore, let’s begin by initially defining the word embeddings of the four differing words for which we will be calculating the attention. Practically speaking, these word embeddings would have been produced by an encoder, but for this particular instance we shall be defining them manually.

 1 2 3 4 5 # encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1])

The next step produces the weight matrices, which we will eventually be multiplying to the word embeddings to produce the queries, keys and values. Here, we shall be producing these weight matrices randomly, although in actual practice these would have been learned during the course of training.

 1 2 3 4 5 6 … # generating the weight matrices random.seed(42) # to allow us to reproduce the same attention values W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3))

Observe how the number of rows of each of these matrices is equivalent to the dimensionality of the word embeddings (which in this scenario is three) to enable us to carry out the matrix multiplication.

Subsequently, the query, key, and value vectors for every word are produced by multiplying every word embedding by each of the weight matrices.

[Control]

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 … # generating the queries, keys and values query_1 = word_1 @ W_Q key_1 = word_1 @ W_K value_1 = word_1 @ W_V   query_2 = word_2 @ W_Q key_2 = word_2 @ W_K value_2 = word_2 @ W_V   query_3 = word_3 @ W_Q key_3 = word_3 @ W_K value_3 = word_3 @ W_V   query_4 = word_4 @ W_Q key_4 = word_4 @ W_K value_4 = word_4 @ W_V

Considering only the starting word for the time being, the next step scores its query vector against all of the key vectors leveraging a dot product operation.

 1 2 3 … # scoring the first query vector against all key vectors scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot(

The score values are subsequently passed via a softmax operation to produce the weights. Prior to doing so, it is usual practice to divide the score values by the square root of the dimensionality of the key vectors (in this scenario, three), to retain the stability of the gradients.

 1 2 3 … # computing the weights by a softmax operation weights = softmax(scores / key_1.shape[0] ** 0.5)

Lastly, the attention output is calculated by a weighted sum of all four value vectors.

[Control]

 1 2 3 4 5 … # computing the attention by a weighted sum of the value vectors attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4)   print(attention)

[0.98522025 1.74174051 0.75652026]

For quicker processing, the same calculations can be implemented in matrix form to produce an attention output for all words in one go.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 from numpy import array from numpy import random from numpy import dot from scipy.special import softmax   # encoder representations of four different words word_1 = array([1, 0, 0]) word_2 = array([0, 1, 0]) word_3 = array([1, 1, 0]) word_4 = array([0, 0, 1])   # stacking the word embeddings into a single array words = array([word_1, word_2, word_3, word_4])   # generating the weight matrices random.seed(42) W_Q = random.randint(3, size=(3, 3)) W_K = random.randint(3, size=(3, 3)) W_V = random.randint(3, size=(3, 3))   # generating the queries, keys and values Q = words @ W_Q K = words @ W_K V = words @ W_V   # scoring the query vectors against all key vectors scores = Q @ K.transpose()   # computing the weights by a softmax operation weights = softmax(scores / K.shape[1] ** 0.5, axis=1)   # computing the attention by a weighted sum of the value vectors attention = weights @ V   print(attention)

[Control]

 1 2 3 4 [[0.98522025 1.74174051 0.75652026]  [0.90965265 1.40965265 0.5       ]  [0.99851226 1.75849334 0.75998108]  [0.99560386 1.90407309 0.90846923]]

The section furnishes additional resources on the subject if you are seeking to delve deeper.

Books

• Advanced Deep Learning with Python, 2019
• Deep Learning Essentials, 2018

Papers

• Neural Machine Translation by Jointly Learning to Align and Translate, 2014

Conclusion

In this guide, you found out about the attention mechanism and its implementation.