
A Bird’s Eye View of Research on Attention
Attention is an idea that is scientifically studied throughout several disciplines, which includes psychology, neuroscience and more lately, machine learning. While all disciplines might have generated their own definitions for attention, there is one fundamental quality they can all concur upon: attention is a mechanism for making both biological and artificial neural systems more flexible.
In this guide, you will discover a summarization of the scientific advances on attention.
After going through this guide, you will be aware:
- The concept of attention that is of criticality to varying scientific disputes
- How attention is inciting a revolution in machine learning, particularly within the domains of natural language processing and computer vision.
Tutorial Summarization
This guideline is subdivided into two portions, which are:
- The idea of attention
- Attention within machine learning
- Attention within NLP
- Attention in Computer Vision
The Concept of Attention
Research on attention identifies its origin in the domain of psychology.
The scientific research of attention started in psychology, where meticulous behavioural experimentation can give rise to accurate illustrations of the tendencies and capabilities of attention in differing circumstances.
Observations from these studies could assist researchers draw inferences about the mental processes underlying these behavioural patterns.
While the differing domains of psychology, neuroscience, and more lately machine learning, have all generated their own definitions of attentions, there is one fundamental quality that is of massive significance to all.
Attention can be referred to as the flexible control of limited computational resources.
Keeping this in mind, the subsequent sections review the role of attention in inciting a revolution in the domain of machine learning.
Attention in Machine Learning
The idea of machine learning is very loosely draws inspiration from the psychological mechanisms of attention in the human brain.
The leveraging of attention mechanisms in artificial neural networks propped up – a lot like the apparent requirement for attention in the brain – as a means of making neural systems a lot more flexible.
The concept is to be able to operate with an artificial neural network that can feature good performance on tasks where the input might be of variable length, size or structure, or even manage various differing activities. It is in the spirit that attention mechanisms with machine learning are stated to inspire themselves from psychology, instead of due to the fact that they replicate the biology of the human brain.
In the shape of attention originally generated for ANNs, attention mechanisms operated within an encoder-decoder framework and in the context of sequence models.
The activity of the encoder is to produce a vector representation of the input, while the activity of the decoder is to transform this vector representation into an output. The attention mechanism links the two.
There have been varying propositions of neural network architectures that go about implementing attention mechanisms, which are also connected to the particular applications in which they identify their usage. Natural Language Processing (NLP) and computer vision are amongst the most widespread applications.
Attention in Natural Language Processing
An early application for attention in NLP was that of machine translation, where the objective was to convert an input sentence in a source language, to an output sentence in a target language. Within this context, the encoder would produce a grouping of context vectors, one for every word in the source sentence. The decoder, on the other hand, would interpret the context vectors to produce an output sentence in the target language, a single word at a time.
In the conventional encoder-decoder framework with no attention, the encoder generated a static-length vector that was independent of the length or features of the input and static over the course of decoding.
Representative of the input of a static-length vector was particularly problematic for protracted sequences or sequences that were complicated in structure, as the dimensionality of their representation was forced to be the same as for shorter or simpler sequences.
For instance, in some languages, like Japanese, the final might be very critical to forecast the initial word, during translation of English to French might be simpler as the order of the sentences (how the sentence is organized) is more like each other.
This developed a bottleneck, whereby the decoder has restricted access to the data furnished by the input – that which is available within the static-length encoding vector. On the other hand, maintaining the length of the input sequence during the course of the encoding process, could make it feasible for the decoder to harness its most relevant parts in a flexible fashion.
The latter is how the attention mechanism functions.
Attention assists in determination of which of these vectors should be leverage to produce the output. As the output sequence is dynamically produced one element at a time, attention can dynamically illustrate differing encoded vectors at every time point. This enables the decoder to flexibly leverage the most appropriate portions of the input sequence.
One of the preliminary works within machine translation that sook to tackle the restriction issue developed by static-length vectors, was by Bahdanau et al. (2014). In their research, Bahadanau et al. deployed the use of Recurrent Neural Networks (RNNs) for both encoding and decoding tasks: the encoder deploys a bi-directional RNN to produce a sequence of annotations that each consist of a summarization of both preceding and succeeding words, and which can be mapped into a context vector through a weighted sum, the decoder then produces an output on the basis of these annotations and the hidden states of another RNN. As the context vector is computed through a weighted sum of the annotations, then Bahdanau et al.’s attention mechanism is an instance of soft attention.
Another of the preliminary works was by Sutskever et al. (2014) who, alternatively, leveraged a multi-layered long short-term memory (LSTM) to encode a vector indicating the input sequence, and another LSTM to decode the vector into a targeted sequence.
Luong et al. (2015) put forth the concept of global versus local attention. In their research, they detailed a global attention model as one that, when obtaining the context vector, takes up all the hidden states of the encoder. The computation of the global context vector is, thus, on the basis of a weighted average of all the words in the source sequence. Luong et al. mentions that this is computationally costly, and could possibly make global attention tough to be applied to protracted sequences. Local attention is proposed to tackle this problem, by concentrating on a smaller subset of the words in the source sequence, per target word. Luong et al. describes that local attention trades-off the soft and hard attentional models of Xu et al. (2016) (we will make references to this paper again in the subsequent section), by being less computationally costly than the soft attention, but simpler to train than the hard attention.
More lately, Vaswani et al. (2017) put forth a completely differing architecture that has steered the domain of machine translation into a new direction. Called the Transformer, their architecture does away with any recurrence and convolutions completely, however implements a self-attention mechanism. Words in the source sequence are initially encoded in parallel to produce key, query and value representations. The keys and queries are brought together to produce attention weightings that capture how every word relates to the others in the sequence. These attention weightings are then leveraged to scale the values, in order to retain concentrate on the critical words and drown out the irrelevant ones.
The output is computed as a weighted sum of the values, where the weight allocated to every value is computed through a compatibility function of the query with the correlating key.
At the time, the proposed Transformer architecture setup a new state-of-the-art on the English-to-German and English-to-French translation activities, and was reportedly also quicker to train in comparison to architectures on the basis of recurrent or convolutional layers. Subsequently, the strategy referred to as BERT by Devlin et al. (2019) developed on Vaswani et al.’s research but putting forth a multi-layer bi-directional architecture.
As we will be observing shortly, the uptake of the Transformer architecture was not just rapid in the domain of NLP, but within the computer vision domain as well.
Attention in Computer Vision
Within computer vision, attention has found its way into various applications, like within the domains of image classification, image segmentation, and image captioning.
If we had to reframe the encoder-decoder model to the task of image captioning, as an instance, then the encoder can be a convolutional neural network (CNN) that captures the salient visual cues in the imagery into a vector representation, while the decoder can be an RNN or LSTM that transforms the vector representation into an output.
Also, in the neuroscience literature, these attentional processes can be divided into spatial and feature-based attention.
In spatial attention, differing spatial locations are attributed differing weights, but these same weights are retained throughout all feature channels at the differing spatial locations.
One of the basic image captioning strategies operating with spatial attention has been put forth by Xu et al. (2016). Their model integrates a CNN as an encoder that extracts a grouping of feature vectors (or annotation vectors), with every vector correlating to a differing portion of the image to allow the decoder to concentrate particularly on specific image portions. The decoder is an LSTM that produces a caption on the basis of a context vector, the prior hidden state, and the prior produced words. Xu et al. investigate the leveraging of hard attention as an alternative to soft attention in computing their context vector. Here, soft attention places weights in a soft manner on all patches of the source image, while hard attention attends to a singular patch alone while disregarding the rest. They report that, in their research, hard attention has better performance.
Feature attention, in contrast, enables individual feature maps to be attributed to their own weight values. One such instance, also applied to image captioning, is the encoder-decoder framework of Chen et al. (2018) which integrates spatial and channel-wise attentions in the same CNN.
Like how the transformer has swiftly become the conventional architecture for NLP activities, it has also been lately considered and adapted by the computer vision community.
The forerunning work to do so was put forth by Dosovitskiy et al. (2020), who applied their Vision Transformer (ViT) to an image classification activity. They put forth the argument that the long-standing reliance on CNNs for image classification was not required, and the same activity could be achieved by a pure transformer. Dosovitskiy et al. reshape an input image into a sequence of flattened 2D image patches, which they subsequently embed by a trainable linear projection to produce the patch embeddings. These patch embeddings combined with their position embeddings, to retain positional information, are inputted into the encoder portion of the Transformer architecture, whose output is then inputted into multilayer perceptron.
Inspired by ViT, and the fact that attention-based architectures are an intuitive option for modelling long-range contextual relationships in video, we produce various transformed-based models for video classification.
Arnab et al. (2021) then extended the ViT model to ViViT, which exploits the spatiotemporal data consisted within videos for the activity of video classification. Their strategy looks into differing strategies of extraction the spatiotemporal data, like by the sampling and embedding every frame independently, or through extraction of non-overlapping tubelets (an image patch that spans across various image frames, developing a tube)Further and embedding every one in turn. They also investigate differing strategies of factorising the spatial and temporal dimensions of the input video, for improved efficacy and scalability.
Subsequent to its initial application with regards to image classification, the Vision Transformer is already being leveraged to various other computer vision domains, like to action localization, gaze estimation, and image generation. This surge of interest amongst computer vision practitioners indicates a thrilling near future, where we’ll be observing more adaptations and applications of the Transformer architecture.
Further Reading
This section furnishes additional resources on the subject if you are seeking to delve deeper.
Books
Deep Learning Essentials, 2018
Papers
Attention in Psychology, Neuroscience, and Machine Learning, 2020
Neural Machine Translation by Jointly Learning to Align and Translate, 2014
Sequence to Sequence Learning with Neural Networks, 2014
Effective Approaches to Attention-based Neural Machine Translation, 2015
Attention is all you need, 2017
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019
Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, 2016
SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning, 2018
An Image is Worth 16×16 words: Transformers for Image Recognition at Scale, 2020
ViViT: A Video Vision Transformer, 2021
Instance Applications
Relation modelling in Spatio-Temporal Action Localization, 2021
Gaze Estimation using Transformer, 2021
ViTGAN: Training GANs with Vision Transformers, 2021
Conclusion
In this guide, you found out about an overview of the research advances on attention.
Particularly you came to know about:
- The idea of attention that is noteworthy to differing scientific disciplines.
- How attention is revolutionizing machine learning, particularly within the domains of natural language processing and computer vision