Caption Generation with the Inject and Merge Encoder-Decoder Models
Caption generation is a challenging artificial intelligence problems that draws inspiration on both computer vision and natural language processing.
The encoder-decoder recurrent neural network architecture has been demonstrated has been demonstrated to be efficient at this problem. The implementation of this architecture can be distilled into inject and merge based models, and both make differing assumptions about the role of the recurrent neural networks in tackling the problem.
In this blog article by AICorespot, you will find out the inject and merge architectures for the encoder-decoder recurrent neural networks models on caption generation.
After going through this post, you will be aware of:
- The challenge of caption generation and the leveraging of the encoder-decoder architecture.
- The inject model that brings together the encoded image with every word to generate the subsequent word in the caption.
- The merge model that independently encodes the image and description which are decoded in order to produce the next word in the caption.
Image Caption Generation
The issue of image caption generation consists of outputting a readable and concise details of the contents of a photograph.
It is a challenging artificial intelligence problem as it needs both strategies from computer vision to interpret the contents of the photograph and strategies from natural language processing to produce the textual description.
Lately, deep learning strategies have accomplished state-of-the-art outcomes on this challenging issue. The outcomes are so outstanding that this problem has become a conventional demonstration problem for the capabilities of deep learning.
Encoder-Decoder Architecture
A conventional encoder-decoder recurrent neural network architecture is leveraged to tackle the image caption generation problem.
This consists of dual elements:
1] Encoder: A network model that reads the photo input and encodes the content into a static-length vector leveraging an internal representation.
2] Decoder: A network model that reads the encoded photograph and produces the textual description output.
Typically, a convolutional neural network is leveraged to encode the imagery and a recurrent neural network, like a long short-term memory network, is leveraged to either encode the text sequence produced so far, and/or generate the subsequent word in the sequence.
There are several ways to realize this architecture for the problem of caption production.
It is typical to leverage a pre-trained convolutional neural network model trained on a challenging photograph classification problem to encode the photograph. The pre-trained model can be loaded, the output of the model deleted, and internal representation of the photograph leveraged as the encoding or internal representation of the input image.
It is also typical to frame the problem such that the model produce a single word of the output textual description provided both the photograph and the description generated thus far as input. In this framing, the model is called upon recursively until the entire output sequence is produced.
This framing can be implemented leveraging one of dual architectures, called by Marc Tanti, et al. as the inject and merge models.
Inject Models
The inject model brings together the encoded form of the image with every word from the text description produced thus far.
This strategy leverages the recurrent neural network as a text generation model that leverages a sequence of both image and word information as input in order to produce the next word in the sequence.
In these ‘inject’ architectures, the image vector (typically derived from the activation values of a hidden layer in a convolutional neural network) is injected into the RNN, for instance, by treating the image vector on par with a ‘word’ and including it as part of the caption prefix.
This model brings together the concerns of the image with each input word, needing the encoder to develop an encoding that integrates both visual and linguistic data together.
In an inject model, the RNN is trained to forecast sequences on the basis of histories consisting of both linguistic and perceptual features. Therefore, in this model, the RNN is mainly accountable for image-conditioned language generation.
Merge Model
The merge model brings together both the encoded form of the image input with the encoded form of the text description produced thus far.
The combo of these two encoded inputs is then leveraged by a very simple decoder model to produce the subsequent word in the sequence.
The approach leverages the recurrent neural network only to encode the text produced thus far.
In the case of ‘merge’ architectures, the image is left out of the RNN subnetwork, like that the RNN handles just the caption prefix, that is, handles only purely linguistic data. After the prefix has been vectorized, the image vector is subsequently merged with the prefix vector in an independent “multimodal layer” which comes after the RNN subnetwork.
This separates the concern of modelling the image input, the text input and the combining and interpretation of the encoded inputs. As specified, it is typical to leverage a pre-trained model for encoding of the image, but likewise, this architecture also allows a pre-trained language model to be leveraged to encode the caption text input.
In the merge architecture, RNNs basically encode linguistic representations, which themselves constitute the input to a later prediction stage that comes after a multimodal layer. It is just at this later stage that image features are leveraged to condition forecasts.
There are several ways to bring together the two encoded inputs, like concatenation, multiplication, and addition, even though experiments by Marc Tanti, et al. have demonstrated addition to function better.
Typically, Marc Tanti, et al. identified the merge architecture to be more efficient contrasted to the inject strategy.
Cumulatively, the evidence indicates that delaying the merging of image features with linguistic encodings to a later stage in the architecture might be advantageous, outcomes indicate that a merge architecture has an increased capacity than an inject architecture and can produce improved quality captions with reduced layers.
More on the merge model
The success of the merge model for the encoder-decoder architecture indicates that the role of the recurrent neural network is to encode input instead of produce output.
This is a departure from the typical comprehension where it is believed that the contribution of the recurrent neural network is that of a generative model.
If the RNN had the main role of producing captions, then it would require to have access to the image in order to know what to produce. This does not appear to be the scenario as including the image into the RNN is not typically advantageous to its performance as a caption generator.
The explicit comparison of the inject and merge models, and the success of merge over inject for caption generation, puts forth the question of if this strategy translates to connected sequence-to-sequence generation problems.
Rather than pre-trained models leveraged to encode images, pre-trained language models could be leveraged to encode source text in problems like text summarization, question answering, and machine translation.
We would like to investigate if similar alterations in architecture would function in sequence-to-sequence activities like machine translation, where rather than conditioning a language model on an image we are conditioning a target language model on sentences in a source language.
Further Reading
This section furnishes additional resources on the subject if you are looking to delve deeper.
- Marc Tanti’s Blog
- Encoder-Decoder Long Short-Term Memory Networks
- Where to put the image in an image caption generator, 2017
- What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? 2017
Conclusion
In this blog article, you found out about the inject and merge architectures for the encoder-decoder recurrent neural network model on caption generation.
Particularly, you learned:
- The challenge of caption generation and the leveraging of the encoder-decoder architecture
- The inject model that brings together the encoded image with every word to produce the next word in the caption.
- The merge model that independently encodes the image and description which are decoded in order to produce the next word in the caption.