>>January

The attention mechanism was put forth to enhance the performance of the encoder-decoder model for machine translation. The concept underlying the attention mechanism was to enable the decoder to utilize the most appropriate parts of the input sequence in a flexible manner, by a weighted combo of all of the encoded input vectors, with the most appropriate vectors being attributed the highest weights. 

In this guide, you will find out about the attention mechanism and its implementation. 

After going through this guide, you will be aware of: 

  • How the attention mechanism leverages a weighted sum of all of the encoder hidden states to flexibly concentrate the focus of the decoder to the most appropriate parts of the input sequence. 
  • How the attention mechanism can be generalized for activities where the data might not necessarily be connected in a sequential fashion. 
  • How to implement the general attention mechanism in Python with NumPy and SciPy. 

Tutorial Summarization 

This tutorial is divided into three portions, which are: 

  • The Attention Mechanism 
  • The General Attention Mechanism 
  • The General Attention Mechanism with NumPy and SciPy 

The Attention Mechanism 

The attention mechanism was put forth by Bahdanau et al. (2014), to tackle the bottleneck problem that props up with the leveraging of a static-length encoding vector, where the decoder would have restricted access to the data furnished by the input. This is thought to turn particularly problematic for long and/or complicated sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences. 

We had observed that Bahdanau et al.’s attention mechanism is divided into the step-by-step computations of the alignment scores, the weights and the context vector: 

But, the attention mechanism can be re-formulated into a generic form that can be applied to any sequence-to-sequence (abbreviated to seq2seq) activity, where the data may not especially be connected in a sequential fashion. 

In other words, the dataset doesn’t have to comprise of the hidden RNN states at differing steps, but consist any kind of data instead. 

The General Attention Mechanism 

The general attention mechanism leverages three primary components, specifically the queries, Q, the keys K, and the values, V. 

If we had to contrast this trio of components to the attention mechanism as put forth by Bahdanau et al., then the query would be analogous to the prior decoder output, st-1, while the values would be analogous to the encoded inputs, hi. In the Bahdanau attention mechanism, the keys and values are the same vector. 

In this scenario, we can think of the vector st-1 as a query carried out against a database of key-value pairs, where the keys are vectors and the hidden states hi, are the values. 

The general attention mechanism then performs the following computations: 

   

Within the context of machine translation, every word in an input sequence would be attributed its own query, key and value vectors. These vectors are produced by multiplication of the encoder’s representation of the particular word under consideration, with three differing weight matrices that would have been produced during training. 

Basically, when the generalized attention mechanism is put forth with a sequence of words, it takes the query vector attributed to some particular word in the sequence and scores it against every key in the database. In doing so, it generates an attention output for the word being considered. 

The General Attention Mechanism with NumPy and SciPy 

In this portion of the blog, we will look into how to implement the general attention mechanism leveraging the NumPy and SciPy libraries in Python. 

For the sake of simplicity, we will, to start with, calculate the attention for the first word in a sequence of four. We will then generalize the code to calculate an attention output for all four words in matrix form. 

Therefore, let’s begin by initially defining the word embeddings of the four differing words for which we will be calculating the attention. Practically speaking, these word embeddings would have been produced by an encoder, but for this particular instance we shall be defining them manually. 

1 

2 

3 

4 

5 

# encoder representations of four different words 

word_1 = array([1, 0, 0]) 

word_2 = array([0, 1, 0]) 

word_3 = array([1, 1, 0]) 

word_4 = array([0, 0, 1]) 

 

The next step produces the weight matrices, which we will eventually be multiplying to the word embeddings to produce the queries, keys and values. Here, we shall be producing these weight matrices randomly, although in actual practice these would have been learned during the course of training. 

1 

2 

3 

4 

5 

6 

 

# generating the weight matrices 

random.seed(42) # to allow us to reproduce the same attention values 

W_Q = random.randint(3, size=(3, 3)) 

W_K = random.randint(3, size=(3, 3)) 

W_V = random.randint(3, size=(3, 3)) 

 

Observe how the number of rows of each of these matrices is equivalent to the dimensionality of the word embeddings (which in this scenario is three) to enable us to carry out the matrix multiplication. 

Subsequently, the query, key, and value vectors for every word are produced by multiplying every word embedding by each of the weight matrices. 

 

[Control] 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

 

# generating the queries, keys and values 

query_1 = word_1 @ W_Q 

key_1 = word_1 @ W_K 

value_1 = word_1 @ W_V 

 

query_2 = word_2 @ W_Q 

key_2 = word_2 @ W_K 

value_2 = word_2 @ W_V 

 

query_3 = word_3 @ W_Q 

key_3 = word_3 @ W_K 

value_3 = word_3 @ W_V 

 

query_4 = word_4 @ W_Q 

key_4 = word_4 @ W_K 

value_4 = word_4 @ W_V 

 

Considering only the starting word for the time being, the next step scores its query vector against all of the key vectors leveraging a dot product operation. 

1 

2 

3 

 

# scoring the first query vector against all key vectors 

scores = array([dot(query_1, key_1), dot(query_1, key_2), dot(query_1, key_3), dot( 

 

The score values are subsequently passed via a softmax operation to produce the weights. Prior to doing so, it is usual practice to divide the score values by the square root of the dimensionality of the key vectors (in this scenario, three), to retain the stability of the gradients. 

1 

2 

3 

 

# computing the weights by a softmax operation 

weights = softmax(scores / key_1.shape[0] ** 0.5) 

 

Lastly, the attention output is calculated by a weighted sum of all four value vectors. 

 

[Control] 

1 

2 

3 

4 

5 

 

# computing the attention by a weighted sum of the value vectors 

attention = (weights[0] * value_1) + (weights[1] * value_2) + (weights[2] * value_3) + (weights[3] * value_4) 

 

print(attention) 

 

[0.98522025 1.74174051 0.75652026] 

For quicker processing, the same calculations can be implemented in matrix form to produce an attention output for all words in one go. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

from numpy import array 

from numpy import random 

from numpy import dot 

from scipy.special import softmax 

 

# encoder representations of four different words 

word_1 = array([1, 0, 0]) 

word_2 = array([0, 1, 0]) 

word_3 = array([1, 1, 0]) 

word_4 = array([0, 0, 1]) 

 

# stacking the word embeddings into a single array 

words = array([word_1, word_2, word_3, word_4]) 

 

# generating the weight matrices 

random.seed(42) 

W_Q = random.randint(3, size=(3, 3)) 

W_K = random.randint(3, size=(3, 3)) 

W_V = random.randint(3, size=(3, 3)) 

 

# generating the queries, keys and values 

Q = words @ W_Q 

K = words @ W_K 

V = words @ W_V 

 

# scoring the query vectors against all key vectors 

scores = Q @ K.transpose() 

 

# computing the weights by a softmax operation 

weights = softmax(scores / K.shape[1] ** 0.5, axis=1) 

 

# computing the attention by a weighted sum of the value vectors 

attention = weights @ V 

 

print(attention) 

 

 

[Control] 

1 

2 

3 

4 

[[0.98522025 1.74174051 0.75652026] 

 [0.90965265 1.40965265 0.5       ] 

 [0.99851226 1.75849334 0.75998108] 

 [0.99560386 1.90407309 0.90846923]] 

 

Further Reading 

The section furnishes additional resources on the subject if you are seeking to delve deeper.  

Books 

  • Advanced Deep Learning with Python, 2019 
  • Deep Learning Essentials, 2018 

Papers 

  • Neural Machine Translation by Jointly Learning to Align and Translate, 2014 

Conclusion 

In this guide, you found out about the attention mechanism and its implementation. 

Particularly, you learned about: 

  • How the attention mechanism leverages a weighted sum of all of the encoder hidden states to flexibly concentrate the attention of the decoder to the most appropriate parts of the input sequence. 
  • How the attention mechanism can be generalized for activities where the data might not necessarily be connected in a sequential fashion. 
  • How to implement the general attention mechanism with NumPy and SciPy. 

The attention mechanism was put forth to enhance the performance of the encoder-decoder model for machine translation. The concept underlying the attention mechanism was to enable the decoder to utilize the most appropriate parts of the input sequence in a flexible manner, by a weighted combo of all of the encoded input vectors, with the most appropriate vectors being attributed the highest weights.

Caption generation is a challenging artificial intelligence problems that draws inspiration on both computer vision and natural language processing. The encoder-decoder recurrent neural network architecture has been demonstrated has been demonstrated to be efficient at this problem. The implementation of this architecture can be distilled into inject and merge based models, and both make differing assumptions about the role of the recurrent neural networks in tackling the problem.

Time series forecasting can be a challenge as there are several differing strategies you could leverage and several differing hyperparameters for every strategy. The Prophet Library is an open-source library developed for making predictions for univariate time series datasets. It is simple to leverage and developed to automatically identify a good set of hyperparameters for the model in an attempt to make skilful predictions for data with trends and seasonal structure by default.