What is attention
Attention is turning very popular within machine learning, but makes it such an enticing concept? What relationship exists between attention in its application in artificial neural networks (ANNs), and its biological counterpart? What are the aspects that one would expect to form an attention-driven system in machine learning?
In this guide, you will find out an overview of attention and how it is applied within machine learning.
After going through this guide, you will be aware of:
- A brief overview of how attention can manifest within the human brain
- The components that consist an attention-driven system, and how these draw inspiration from biological attention.
Tutorial Summarization
This tutorial is subdivided into two portions, which are:
- Attention
- Attention within machine learning
Attention
Attention is a broadly looked into concept that has often been studied in combination with arousal, alertness, and engagement with one’s surroundings.
In it’s most generic variation, attention could be detailed as merely an overall state of alertness or capability to engage with the surroundings.
Visual attention is one of the spheres that is most typically researched from both the neuroscientific and psychological viewpoints.
When a topic is put forth with differing imagery, the eye movements that the topic performs can reveal the salient image portions that the topic’s attention is majorly attracted to. In their review on computational frameworks for visual attention, Itti and Kooch (2001) mention that such salient image portions are typically personified by visual traits that consist of intensity contrast, oriented edges, corners and junctions, and motion. The human brain tends to these critical visual features at differing neuronal stages.
Neurons at preliminary stages are tuned to simplistic visual traits like intensity contrast, colour opponency, orientation, direction and velocity of motion, or stereo disparity at various spatial scale. Neuronal tuning becomes more and more specialized with the progression from low-level to high-level visual regions, such that higher-level visual regions consist of neurons that respond only to corners or junctions, shape-from shading cues or perspectives of particular real-world objects.
Fascinatingly, research has also observed that differing topics have a tendency to be attracted to the same salient visual cues.
Research has also found out various forms of interactions amongst memory and attention. As the human brain has a restricted memory capacity, then choosing which data to record becomes critical in making the best use of the restricted resources. The human brain does so by being reliant on attention, such that it dynamically highlight and leverage the salient parts of the data at hand, in a similar fashion as it functions within the human brain, that makes attention such an attractive idea within machine learning.
An attention-driven systems is thought to consist of a trio of components:
1] A process that “reads” raw data (like source words in a source sentence), and translates them into distributed representations, with one feature vector connected with every word position.
2] A list of feature vectors recording the output of the reader. This can be comprehended as a “memory” consisting of a sequence of facts, which can be recovered later, not necessarily in the same order, without having to visit all of them.
3] A process that “exploits” the content of the memory to sequentially carry out an activity, at every time step possessing the capability to put attention on the content of one memory element (or a few, with a differing weight)
Let’s take the encoder-decoder system as an instance, as it was within such a framework that the attention mechanism was first put forth.
If we are undertaking process of an input sequence of words, then this will initially be inputted into an encoder, which will output a vector for each element in the sequence. This correlates to the first component of your attention-driven systems, as detailed above.
A listing of these vectors (the 2nd component of the attention-driven system above), combined with the decoder’s prior hidden states, will be exploited by the attention mechanism to dynamically highlight which of the input data will be leveraged to produce the output.
At every time step, the attention mechanism, then takes the prior hidden state of the decoder and the listing of encoded vectors, and leverages them to produce unnormalized score values that signify how well the factors of the input sequence align with the present output. As the produced score values require to make relative sense in terms of their criticality, they are normalized by putting them through a softmax function to produce the weights. Following the softmax normalization, all of the weight values will lie in the interval [0,1] and will add up to 1, which implies that they can be interpreted as probabilities. Lastly, the encoded vectors are scaled by the computed weights to produce a context vector.
This attention procedure forms the third component of the attention-based system above. It is this context vector that is, then, inputted into the decoder to produce a translated output.
This variant of artificial attention is therefore a variant of iterative re-weighting. Particularly, it dynamically highlights differing components of a pre-processed input as they are required for output production. This makes it flexible and context dependent, like biological attention.
The procedure implemented by a framework that integrates an attention mechanism contrasts with one that does not. In the latter, the encoder would produce a static-length vector regardless of the input’s length or complexity. Without a mechanism that illustrates the critical data across the total of the input, the decoder would just have access to the restricted data that would be encoded within the static-length vector. This would possibly have the outcome in the decoder missing critical data.
The attention mechanism was initially put forth to process sequences of words in machine translation, which possess an implied temporal aspect to them. Although, it can be generalized to process data that can be static, and not necessarily connected in a sequential manner, like in the context of image processing. We will be observing how this generalization can be accomplished in another tutorial.
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Books
- Deep Learning Essentials, 2018
- Deep Learning, 2017
Papers
- Attention in Psychology, Neuroscience, and Machine Learning, 2020.
- Computational modelling of visual attention, 2001
Conclusion
In this guide, you found out an overview of attention and its application within machine learning.
Particularly, you learned:
- A brief summarization of how attention can manifest itself in the human brain.
- The components that consist an attention-driven system, and how these are inspired from biological attention.