Image Captioning through Image Transformer
Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.READ FULL TEXT VIEW PDF
Image Captioning through Image Transformer
Image captioning is the task of describing the content of an image in words. The problem of automatic image captioning by AI systems has received a lot of attention in the recent years, due to the success of deep learning models for both language and image processing. Most image captioning approaches in the literature are based on atranslational approach, with a visual encoder and a linguistic decoder. One challenge in automatic translation is that it cannot be done word by word, but that other words influence then meaning, and therefore the translation, of a word; this is even more true when translating across modalities, from images to text, where the system must decide what must be described in the image. A common solution to this challenge relies on attention mechanisms. For example, previous image captioning models try to solve where to look in the image [35, 4, 2, 24] (now partly solved by the Faster-RCNN object detection model 
) in the encoding stage and use a recurrent neural network with attention mechanism in the decoding stage to generate the caption. But more than just to decide what to describe in the image, recent image captioning models propose to use attention to learn how regions of the image relate to each other, effectively encoding theircontext
in the image. Graph convolutional neural networks were first introduced to relate regions in the image; however, those approaches [37, 36, 9, 38]
usually require auxiliary models (e.g. visual relationship detection and/or attribute detection models) to build the visual scene graph in the image in the first place. In contrast, in the natural language processing field, the transformer architecture was developed to relate embedded words in sentences, and can be trained end to end without auxiliary models explicitly detecting such relations. Recent image captioning models [14, 19, 12] adopted the transformer architectures to implicitly relate informative regions in the image through dot-product attention achieving state-of-the-art performance.
However, the transformer architecture was designed for machine translation of text. In a text, a word is either to the left or to the right of another word, with different distances. In contrast, images are two-dimensional (indeed, represent three-dimensional scenes), so that a region may not only be on the left or right of another region, it may also contain or be contained in another region. The relative spatial relationship between the semantic units in images has a larger degree of freedom than that in sentences. Furthermore, in the decoding stage of machine translation, a word is usually translated into another word in other languages (one to one decoding), whereas for an image region, we may describe its context, its attribute and/or its relationship with other regions (one to more decoding). One limitation of previous transformer-based image captioning models[14, 19, 12] is that they adopt the transformer’s internal architecture designed for the machine translation, where each transformer layer contains a single (multi-head) dot-product attention module. In this paper, we introduce the image transformer for image captioning, where each transformer layer implements multiple sub-transformers, to encode spatial relationships between image regions and decode the diverse information in image regions.
The difference between our method and previous transformer based models [14, 12, 19] is that our method focuses on the inner architectures of the transformer layer, in which we widen the transformer module. Yao et al.  also used a hierarchical concept in the encoding part of their model, but our model uses a graph hierarchy whereas their method is a tree hierarchy. Furthermore, our model does not require auxiliary models (ie, for visual relation detection and instance segmentation) to build the visual scene graph. Our encoding method can be viewed as the combination of a visual semantic graph and a spatial graph which use a transformer layer to implicitly combine them without auxiliary relationship and attribute detectors.
The contributions of this paper can be summarised as follows:
We propose a novel internal architecture for the transformer layer adapted to the image captioning task, with a modified attention module suited to the complex natural structure of image regions.
We report thorough experiments and ablation study were done in the work to validate our proposed architecture, state-of-the-art performance was achieved on the MSCOCO image captioning offline and online testing dataset with only region features as input.
The rest of the paper is organized as follows: Sec. 2 reviews the related attention-based image captioning models; Sec. 3 introduces the standard transformer model and our proposed image transformer; followed by the experiment results and analysis in Sec. 4; finally, we will conclude this paper in Sec. 5.
We characterize current attention-based image captioning models into single-stage attention models, two-stages attention models, visual scene graph based models, and transformer-based models. We will review them one by one in this section.
Single-stage attention-based image captioning models are the models where attention is applied at the decoding stage, where the decoder attends to the most informative region  in the image when generating a corresponding word.
proposed the first deep model for image captioning. Their model uses a CNN pre-trained on ImageNet to encode the image, then a LSTM  based language model is used to decode the image features into a sequence of words. Xu et al.  introduced an attention mechanism into image captioning during the generation of each word, based on the hidden state of their language model and the previous generated word. Their attention module generates a matrix to weight each receptive field in the encoded feature map, and then feed the weighted feature map and the previous generated word to the language model to generate the next word. Instead of only attending to the receptive field in the encoded feature map, Chen et al.  added a feature channel attention module, their channel attention module re-weight each feature channel during the generation of each word. Not all the words in the sentence have a correspondence in the image, so Lu et al.  proposed an adaptive attention approach, where their model has a visual sentinel which adaptively decides when and where to rely on the visual information.
The single-stage attention model is computational efficient, but lacks accurate positioning of informative regions in the original image.
Two stage attention models consists of bottom-up attention and top-down attention, where bottom-up attention first uses object detection models to detect multiple informative regions in the image, then top-down attention attends to the most relevant detected regions when generating a word.
Instead of relying on the coarse receptive fields as informative regions in the image, as single-stage attention models do, Anderson et al.  train the detection models on the Visual Genome dataset . The trained detection models can detect
informative regions in the image. They then use a two-layers LSTM network as decoder, where the first layer generates a state vector based on the embedded word vector and the mean feature of the detected regions and the second layer uses the state vector from the previous layer to generate a weight for each detected region. The weighted sum of detected regions feature is used as a context vector for predicting the next word. Luet al.  developed a similar network, but with a detection model trained on MSCOCO , which is a smaller dataset than Visual Genome, and therefore less informative regions are detected.
The performance of two-stages attention based image captioning models are improved a lot against single-stage attention based models. However, each detected region is isolated from others, lacking the relationship with other regions.
Visual scene graph based image captioning models extend two-stage attention models by injecting a graph convolutional neural network to relate detected informative regions, and therefore refine their features before feeding into the decoder.
Yao et al.  developed a model which consists of a semantic scene graph and a spatial scene graph. In the semantic scene graph, each region is connected with other semantically related regions, those relationships are usually determined by a visual relationship detector among a union box. In the spatial scene graph, the relationship between two regions is defined by their relative positions. Then the feature of each node in the scene graph is refined with their related nodes through graph neural networks . Yang et al.  use an auto-encoder, where they first encode the graph structure in the sentence based on the SPICE evaluation metric to learn a dictionary, then the semantic scene graph is encoded using the learnt dictionary. The previous two works treat the semantic relationships as edges in the scene graph, while Guo et al.  treat them as nodes in the scene graph. Also, their decoder focuses on different aspects of a region. Yao et al. further introduces the tree hierarchy and instance level feature into the scene graph.
Introducing the graph neural network to relate informative regions yields a sizeable performance improvement for image captioning models, compared to two-stages attention models. However, it requires auxiliary models to detect and build the scene graph at first. Also those models usually have two parallel streams, one responsible for the semantic scene graph and another for spatial scene graph, which is computationally inefficient.
Transformer based image captioning models use the dot-product attention mechanism to relate informative regions implicitly.
Since the introduction of original transformer model , more advanced architectures were proposed for machine translation based on the structure or the natural characteristic of sentences [10, 33, 34]. In image captioning, AoANet  uses the original internal transformer layer architecture, with the addition of a gated linear layer  on top of the multi-head attention. The object relation network  injects the relative spatial attention into the dot-product attention. Another interesting result described by Herdade et al.  is that the simple position encoding (as proposed in the original transformer) did not improve image captioning performance. The entangled transformer model  features a dual parallel transformer to encode and refine visual and semantic information in the image, which is fused through gated bilateral controller.
Compared to scene graph based image captioning models, transformer based models do not require auxiliary models to detect and build the scene graph at first, which is more computational efficient. However current transformer based models still use the inner architecture of the original transformer, designed for text, where each transformer layer has a single multi-head dot-product attention refining module. This structure does not allow to model the full complexity of relations between image regions, therefore we propose to change the inner architecture of the transformer layer to adapt it to image data. We widen the transformer layer, such that each transformer layer has multiple refining modules for different aspects of regions both in the encoding and decoding stages.
In this section, we first review the original transformer layer , we then elaborate the encoding and decoding part for the proposed image transformer architecture.
A transformer consists of a stack of multi-head dot-product attention based transformer refining layer.
In each layer, for a given input , consisting of entries of
dimensions. In natural language processing, the input entry can be the embedded feature of a word in a sentence, and in computer vision or image captioning, the input entry can be the feature describing a region in an image. The key function of transformer is to refine each entry with other entries through multi-head dot-product attention. Each layer of a transformer first transforms the input into queries (, ), keys (, ) and values (,
) though linear transformations, then the scaled dot-product attention is defined by:
where is the dimension of the key vector and the dimension of the value vector ( in the implementation). To improve the performance of the attention layer, multi-head attention is applied:
The output from the multi-head attention is then added with the input and normalised:
where denote layer normalisation.
The transformer implements residual connections in each module, such that the final output of a transformer layer is:
where is a feed-forward network with non-linearity.
Each refining layer takes the output of its previous layer as input (the first layer takes the original input). The decoding part is also a stack of transformer refining layers, which take the output of encoding part as well as the embedded features of previous predicted word.
In contrast to the original transformer, which only considers spatial relationships between query and key pairs as neighborhood, we propose to use a hierarchical graph transformer in the encoding part, where we consider three categories of relationship in a hierarchical graph structure: parent, neighbor, and child as shown in Fig. 2(a). Thus we widen each transformer layer by adding three sub-transformer layers in parallel in each layer, each sub-transformer responsible for a category of relationship, all sharing the same query. In the encoding stage, we define the relative spatial relationship between two regions based on their overlap (Fig. 2(b)). We first compute the hierarchical graph adjacent matrices (parent node adjacent matrix), (neighbor node adjacent matrix), and (child node adjacent matrix) for all regions in the image:
where in our experiment. The hierarchical graph adjacent matrices are used as the spatial hard attention embedded into each sub-transformer to combine the output of each sub-transformer in the encoder. More specifically, the original encoding transformer defined in Eqs. (1) and (2) are reformulated as:
is the Hadamard product, and
As we widen the transformer, we halve the number of stacks in the encoder to achieve similar complexity as the original one (3 stacks, while the original transformer features 6 stacks). Note that the original transformer architecture is a special case of the proposed architecture, when no region in the image either contains or is contained by another.
Our decoder consists of a LSTM  layer and an implicit transformer decoding layer, which we proposed to decode the diverse information in a region in the image.
At first, the LSTM layer receives the mean of the output () from the encoding transformer, a context vector () at last time step and the embedded feature vector of current word in the ground truth sentence:
Where, is the word embedding matrix, is the word in the ground truth. The output state is then transformed linearly and treated as the query for the input of the implicit decoding transformer layer. The difference between the original transformer layer and our implicit decoding transformer layer is that we also widen the decoding transformer layer by adding several sub-transformers in parallel in one layer, such that each sub-transformer can implicitly decode different aspects of a region. It is formalised as follows:
Then, the mean of the sub-transformers’ output is passed through a gated linear layer (GLU)  to extract the new context vector () at the current step by channel:
The context vector is then used to predict the probability of word at time step:
Given a target ground truth as a sequence of words , for training the model parameters , we follow the previous method, such that we first train the model with cross-entropy loss:
where is the score function and the gradient is approximated by:
|scene graph based model|
|transformer based model|
Our model is trained on the MSCOCO image captioning dataset . We follow Karpathy’s splits , with 11,3287 images in the training set, 5,000 images in the validation set and 5,000 images in the test set. Each image has 5 captions as ground truth. We discard the words which occur less than 4 times, and the final vocabulary size is 10,369. We test our model on both Karpathy’s offline test set (5,000 images) and MSCOCO online testing datasets (40,775 images). We use Bleu , METEOR , ROUGE-L , CIDEr , and SPICE  as evaluation metrics.
Following previous work, we first train Faster R-CNN on Visual Genome , use resnet-101  as backbone, pretrained on ImageNet . For each image, we can detect informative regions, the boundaries of each are first normalised and then used to compute the hierarchical graph matrices. We then train our proposed model for image captioning using the computed hierarchical graph matrices and extracted features for each image region. We first train our model with cross-entropy
loss for 25 epochs, the initial learning rate is set to, and we decay the learning rate by every 3 epochs. Our model is optimized through Adam 
with a batch size of 10. We then further optimize our model by reinforced learning for another 35 epochs. The size of the decoder’s LSTM layer is set to 1024, and beam search of size 3 is used in the inference stage.
We compare our model’s performance with published image captioning models. The compared models include the top performing single-stage attention model, Att2all ; two-stages attention based models, n-babytalk  and up-down ; visual scene graph based models, GCN-LSTM , AUTO-ENC , ALV , GCN-LSTM-HIP ; and transformer based models Entangle-T , AoA , VORN . The comparison on the MSCOCO Karpathy offline test set is illustrated in Table 1. Our model achieves new state-of-the-art on the CIDEr and SPICE score, while other evaluation scores are comparable to the previous top performing models. Note that because most visual scene graph based models fused semantic and spatial scene graph, and require the auxiliary models to build the scene graph at first, our model is more computationally efficient. VORN  also integrated spatial attention in their model, and our model performs better than them among all kinds of evaluation metrics, which shows the superiority of our hierarchical graph. The MSCOCO online testing results are listed in Tab. 2, our model outperforms previous transformer based model on several evaluation metrics.
|scene graph based model|
|transformer based model|
In the ablation study, we use AoA  as a strong baseline 111Our experiments are based on the code released at: https://github.com/husthuaan/AoANet (with a single multi-head dot-product attention module per layer), which add the gated linear layer  on top of the multi-head attention. In the encoder part, we study the hierarchy’s effect in the encoder, where we ablate the hierarchy by simply taking the mean output of three sub-transformers in each layer by reformulating Eqs. 6 and 7 as: . We also study where to use our proposed hierarchical graph encoding transformer layer in the encoding part: in the first layer, second layer, third layer or three of them? In the decoding part, we study the effect of the number of sub-transformers ( in Eq. 10) in the implicit decoding transformer layer.
As we can see from Tab. 3, by widening the encoding transformer layer, there is a significant improvement on the model’s performance. While not every layers in the encoding transformer are equal, when we use our proposed transformer layer at the top layer of the encoding part, the improvement was reduced. This may be because spatial relationships at the top layer of the transformer are not as informative, we use our hierarchical transformer layer at all layers in the encoding part. When we reduce the hierarchy in our proposed wider transformer layer, there is also some performance reduction, which shows the importance of the hierarchy in our design. After widening the decoding transformer, the improvement was further increased (the CIDEr score increased from 118.2 to 119.1 after widening the decoding transformer layer with 3 sub-transformers), while not more wider gives better result, with 4 sub-transformers in the decoding transformer layer, there is some performance decrease, therefore the final design of our decoding transformer layer has 3 sub-transformers in parallel. The qualitative example of our models results is illustrated in Fig. 5. As we can see, the baseline model without spatial relationships wrongly described the police officers on a red bus (top right), and people on a train (bottom left).
the transformer layer can be seen as an implicit graph, which relates the informative regions through dot-product attention. Here we visualise how our proposed hierarchical graph transformer layer learn to connect the informative regions through attention in Fig. 6. In the top example, the original transformer layer strongly relates the train with the people on the mountain, yields wrong description, while our proposed transformer layer relates the train with the tracks and mountain; in the bottom example, the original transformer relates the bear with its reflection in water and treats them as ‘two bears’, while our transformer can distinguish the bear from its reflection and relate it to the snow area.
We also visualised the output of our decoding transformer layer (Fig. 7
). Compared to the original decoding transformer layer, which only has one sub-transformer inside it. The output of our proposed implicit decoding transformer layer covers a larger area in the reduced feature space than the original one, which means that our decoding transformer layer decoding more information in the image regions. In the original feature space (1,024 dimensions) from the output of decoding transformer layer, we compute the trace of the feature maps’ co-variance matrix from 1,000 examples, the trace for original transformer layer iscompared to for our wider decoding transformer layer, which indicates that our design enables the decoder’s output to cover a larger area in the feature space. However, it looks like individual sub-transformers in the decoding transformer layer still do not learn to disentangle different factors in the feature space (as there is no distinct cluster from the output of each sub-transformer), we speculate this is because we have no direct supervision to their output, which may not able to learn the disentangled feature automatically .
In this work, we introduced the image transformer architecture. The core idea behind the proposed architecture is to widen the original transformer layer, designed for machine translation, to adapt it to the structure of images. In the encoder, we widen the transformer layer by exploiting the hierarchical spatial relationships between image regions, and in the decoder, the wider transformer layer can decode more information in the image regions. Extensive experiments were done to show the superiority of the proposed model, the qualitative and quantitative analyses were illustrated in the experiments to validate the proposed encoding and decoding transformer layer. Compared to the previous top models in image captioning, our model achieves a new state-of-the-art SPICE score, while in the other evaluation metrics, our model is either comparable or outperforms the previous best models, with a better computational efficiency.
We hope our work can inspire the community to develop more advanced transformer based architectures that can not only benefit image captioning but also other computer vision tasks which need relational attention inside it. Our code will be shared with the community to support future research.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: §1, §2.2, Table 1, §4.3.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. Cited by: §2.4, §3.3, §4.4.
Multi-granularity self-attention for neural machine translation. arXiv preprint arXiv:1909.02222. Cited by: §2.4.
Proc. ACL workshop on Text Summarization Branches Out, pp. 10. Cited by: §4.1.
Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the 36th International Conference on Machine Learning-Volume 97, pp. 4114–4124. Cited by: §4.4.