Image Captioning through Image Transformer

by   Sen He, et al.

Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.


page 13

page 14


CPTR: Full Transformer Network for Image Captioning

In this paper, we consider the image captioning task from a new sequence...

Image Captioning: Transforming Objects into Words

Image captioning models typically follow an encoder-decoder architecture...

M^2: Meshed-Memory Transformer for Image Captioning

Transformer-based architectures represent the state of the art in sequen...

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

The Multimedia and Computer Vision Lab of the University of Augsburg par...

Multi-Image Summarization: Textual Summary from a Set of Cohesive Images

Multi-sentence summarization is a well studied problem in NLP, while gen...

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...

Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring

Camera-based passive dietary intake monitoring is able to continuously c...

Code Repositories


Image Captioning through Image Transformer

view repo

1 Introduction

Image captioning is the task of describing the content of an image in words. The problem of automatic image captioning by AI systems has received a lot of attention in the recent years, due to the success of deep learning models for both language and image processing. Most image captioning approaches in the literature are based on a

translational approach, with a visual encoder and a linguistic decoder. One challenge in automatic translation is that it cannot be done word by word, but that other words influence then meaning, and therefore the translation, of a word; this is even more true when translating across modalities, from images to text, where the system must decide what must be described in the image. A common solution to this challenge relies on attention mechanisms. For example, previous image captioning models try to solve where to look in the image [35, 4, 2, 24] (now partly solved by the Faster-RCNN object detection model [28]

) in the encoding stage and use a recurrent neural network with attention mechanism in the decoding stage to generate the caption. But more than just to decide what to describe in the image, recent image captioning models propose to use attention to learn how regions of the image relate to each other, effectively encoding their


in the image. Graph convolutional neural networks

[17] were first introduced to relate regions in the image; however, those approaches [37, 36, 9, 38]

usually require auxiliary models (e.g. visual relationship detection and/or attribute detection models) to build the visual scene graph in the image in the first place. In contrast, in the natural language processing field, the transformer architecture

[30] was developed to relate embedded words in sentences, and can be trained end to end without auxiliary models explicitly detecting such relations. Recent image captioning models [14, 19, 12] adopted the transformer architectures to implicitly relate informative regions in the image through dot-product attention achieving state-of-the-art performance.

However, the transformer architecture was designed for machine translation of text. In a text, a word is either to the left or to the right of another word, with different distances. In contrast, images are two-dimensional (indeed, represent three-dimensional scenes), so that a region may not only be on the left or right of another region, it may also contain or be contained in another region. The relative spatial relationship between the semantic units in images has a larger degree of freedom than that in sentences. Furthermore, in the decoding stage of machine translation, a word is usually translated into another word in other languages (one to one decoding), whereas for an image region, we may describe its context, its attribute and/or its relationship with other regions (one to more decoding). One limitation of previous transformer-based image captioning models

[14, 19, 12] is that they adopt the transformer’s internal architecture designed for the machine translation, where each transformer layer contains a single (multi-head) dot-product attention module. In this paper, we introduce the image transformer for image captioning, where each transformer layer implements multiple sub-transformers, to encode spatial relationships between image regions and decode the diverse information in image regions.

The difference between our method and previous transformer based models [14, 12, 19] is that our method focuses on the inner architectures of the transformer layer, in which we widen the transformer module. Yao et al. [38] also used a hierarchical concept in the encoding part of their model, but our model uses a graph hierarchy whereas their method is a tree hierarchy. Furthermore, our model does not require auxiliary models (ie, for visual relation detection and instance segmentation) to build the visual scene graph. Our encoding method can be viewed as the combination of a visual semantic graph and a spatial graph which use a transformer layer to implicitly combine them without auxiliary relationship and attribute detectors.

Figure 1: Image captioning vs machine translation.

The contributions of this paper can be summarised as follows:

  • We propose a novel internal architecture for the transformer layer adapted to the image captioning task, with a modified attention module suited to the complex natural structure of image regions.

  • We report thorough experiments and ablation study were done in the work to validate our proposed architecture, state-of-the-art performance was achieved on the MSCOCO image captioning offline and online testing dataset with only region features as input.

The rest of the paper is organized as follows: Sec. 2 reviews the related attention-based image captioning models; Sec. 3 introduces the standard transformer model and our proposed image transformer; followed by the experiment results and analysis in Sec. 4; finally, we will conclude this paper in Sec. 5.

2 Related Work

We characterize current attention-based image captioning models into single-stage attention models, two-stages attention models, visual scene graph based models, and transformer-based models. We will review them one by one in this section.

2.1 Single-Stage Attention Based Image Captioning

Single-stage attention-based image captioning models are the models where attention is applied at the decoding stage, where the decoder attends to the most informative region [25] in the image when generating a corresponding word.

The availability of large-scale annotated datasets [7, 5] enabled the training of deep models for image captioning. Vinyals et al. [32]

proposed the first deep model for image captioning. Their model uses a CNN pre-trained on ImageNet 

[7] to encode the image, then a LSTM [8] based language model is used to decode the image features into a sequence of words. Xu et al. [35] introduced an attention mechanism into image captioning during the generation of each word, based on the hidden state of their language model and the previous generated word. Their attention module generates a matrix to weight each receptive field in the encoded feature map, and then feed the weighted feature map and the previous generated word to the language model to generate the next word. Instead of only attending to the receptive field in the encoded feature map, Chen et al. [4] added a feature channel attention module, their channel attention module re-weight each feature channel during the generation of each word. Not all the words in the sentence have a correspondence in the image, so Lu et al. [23] proposed an adaptive attention approach, where their model has a visual sentinel which adaptively decides when and where to rely on the visual information.

The single-stage attention model is computational efficient, but lacks accurate positioning of informative regions in the original image.

2.2 Two-Stages Attention Based Image Captioning

Two stage attention models consists of bottom-up attention and top-down attention, where bottom-up attention first uses object detection models to detect multiple informative regions in the image, then top-down attention attends to the most relevant detected regions when generating a word.

Instead of relying on the coarse receptive fields as informative regions in the image, as single-stage attention models do, Anderson et al. [2] train the detection models on the Visual Genome dataset [18]. The trained detection models can detect

informative regions in the image. They then use a two-layers LSTM network as decoder, where the first layer generates a state vector based on the embedded word vector and the mean feature of the detected regions and the second layer uses the state vector from the previous layer to generate a weight for each detected region. The weighted sum of detected regions feature is used as a context vector for predicting the next word. Lu 

et al. [24] developed a similar network, but with a detection model trained on MSCOCO [21], which is a smaller dataset than Visual Genome, and therefore less informative regions are detected.

The performance of two-stages attention based image captioning models are improved a lot against single-stage attention based models. However, each detected region is isolated from others, lacking the relationship with other regions.

2.3 Visual Scene Graph Based Image Captioning

Visual scene graph based image captioning models extend two-stage attention models by injecting a graph convolutional neural network to relate detected informative regions, and therefore refine their features before feeding into the decoder.

Yao et al. [37] developed a model which consists of a semantic scene graph and a spatial scene graph. In the semantic scene graph, each region is connected with other semantically related regions, those relationships are usually determined by a visual relationship detector among a union box. In the spatial scene graph, the relationship between two regions is defined by their relative positions. Then the feature of each node in the scene graph is refined with their related nodes through graph neural networks [17]. Yang et al. [36] use an auto-encoder, where they first encode the graph structure in the sentence based on the SPICE [1]evaluation metric to learn a dictionary, then the semantic scene graph is encoded using the learnt dictionary. The previous two works treat the semantic relationships as edges in the scene graph, while Guo et al. [9] treat them as nodes in the scene graph. Also, their decoder focuses on different aspects of a region. Yao et al.[38] further introduces the tree hierarchy and instance level feature into the scene graph.

Introducing the graph neural network to relate informative regions yields a sizeable performance improvement for image captioning models, compared to two-stages attention models. However, it requires auxiliary models to detect and build the scene graph at first. Also those models usually have two parallel streams, one responsible for the semantic scene graph and another for spatial scene graph, which is computationally inefficient.

2.4 Transformer Based Image Captioning

Transformer based image captioning models use the dot-product attention mechanism to relate informative regions implicitly.

Since the introduction of original transformer model [30], more advanced architectures were proposed for machine translation based on the structure or the natural characteristic of sentences [10, 33, 34]. In image captioning, AoANet [14] uses the original internal transformer layer architecture, with the addition of a gated linear layer [6] on top of the multi-head attention. The object relation network [12] injects the relative spatial attention into the dot-product attention. Another interesting result described by Herdade et al. [12] is that the simple position encoding (as proposed in the original transformer) did not improve image captioning performance. The entangled transformer model [19] features a dual parallel transformer to encode and refine visual and semantic information in the image, which is fused through gated bilateral controller.

Compared to scene graph based image captioning models, transformer based models do not require auxiliary models to detect and build the scene graph at first, which is more computational efficient. However current transformer based models still use the inner architecture of the original transformer, designed for text, where each transformer layer has a single multi-head dot-product attention refining module. This structure does not allow to model the full complexity of relations between image regions, therefore we propose to change the inner architecture of the transformer layer to adapt it to image data. We widen the transformer layer, such that each transformer layer has multiple refining modules for different aspects of regions both in the encoding and decoding stages.

3 Image Transformer

In this section, we first review the original transformer layer [30], we then elaborate the encoding and decoding part for the proposed image transformer architecture.

Figure 2: The overall architecture of our model, the refinement part consists of 3 stacks of hierarchical graph transformer layer, and the decoding part has a LSTM layer with a implicit decoding transformer layer.

3.1 Transformer Layer

A transformer consists of a stack of multi-head dot-product attention based transformer refining layer.

In each layer, for a given input , consisting of entries of

dimensions. In natural language processing, the input entry can be the embedded feature of a word in a sentence, and in computer vision or image captioning, the input entry can be the feature describing a region in an image. The key function of transformer is to refine each entry with other entries through multi-head dot-product attention. Each layer of a transformer first transforms the input into queries (

, ), keys (, ) and values (,

) though linear transformations, then the scaled dot-product attention is defined by:


where is the dimension of the key vector and the dimension of the value vector ( in the implementation). To improve the performance of the attention layer, multi-head attention is applied:


The output from the multi-head attention is then added with the input and normalised:


where denote layer normalisation.

The transformer implements residual connections in each module, such that the final output of a transformer layer is:


where is a feed-forward network with non-linearity.

Each refining layer takes the output of its previous layer as input (the first layer takes the original input). The decoding part is also a stack of transformer refining layers, which take the output of encoding part as well as the embedded features of previous predicted word.

3.2 Hierarchical Graph Encoding Transformer Layer

(a) Hierarchical graph example
(b) Region overlap
Figure 3: (a) Example for the hierarchical graph: For region C, region A is its parent, B its neighbor and D its child; (b) Region overlap to determine the relative spatial relationships.

In contrast to the original transformer, which only considers spatial relationships between query and key pairs as neighborhood, we propose to use a hierarchical graph transformer in the encoding part, where we consider three categories of relationship in a hierarchical graph structure: parent, neighbor, and child as shown in Fig. 2(a). Thus we widen each transformer layer by adding three sub-transformer layers in parallel in each layer, each sub-transformer responsible for a category of relationship, all sharing the same query. In the encoding stage, we define the relative spatial relationship between two regions based on their overlap (Fig. 2(b)). We first compute the hierarchical graph adjacent matrices (parent node adjacent matrix), (neighbor node adjacent matrix), and (child node adjacent matrix) for all regions in the image:


where in our experiment. The hierarchical graph adjacent matrices are used as the spatial hard attention embedded into each sub-transformer to combine the output of each sub-transformer in the encoder. More specifically, the original encoding transformer defined in Eqs. (1) and (2) are reformulated as:


is the Hadamard product, and


As we widen the transformer, we halve the number of stacks in the encoder to achieve similar complexity as the original one (3 stacks, while the original transformer features 6 stacks). Note that the original transformer architecture is a special case of the proposed architecture, when no region in the image either contains or is contained by another.

Figure 4: The difference between the original transformer layer and the proposed encoding and decoding transformer layers.

3.3 Implicit Decoding Transformer Layer

Our decoder consists of a LSTM [13] layer and an implicit transformer decoding layer, which we proposed to decode the diverse information in a region in the image.

At first, the LSTM layer receives the mean of the output () from the encoding transformer, a context vector () at last time step and the embedded feature vector of current word in the ground truth sentence:


Where, is the word embedding matrix, is the word in the ground truth. The output state is then transformed linearly and treated as the query for the input of the implicit decoding transformer layer. The difference between the original transformer layer and our implicit decoding transformer layer is that we also widen the decoding transformer layer by adding several sub-transformers in parallel in one layer, such that each sub-transformer can implicitly decode different aspects of a region. It is formalised as follows:


Then, the mean of the sub-transformers’ output is passed through a gated linear layer (GLU) [6] to extract the new context vector () at the current step by channel:


The context vector is then used to predict the probability of word at time step



The overall architecture of our model is illustrated in Fig. 2, and the difference between the original transformer layer and our proposed encoding and decoding transformer layer is showed in Fig. 4.

3.4 Training Objectives

Given a target ground truth as a sequence of words , for training the model parameters , we follow the previous method, such that we first train the model with cross-entropy loss:


then followed by self-critical reinforced training [29] optimizing the CIDEr score [31]:


where is the score function and the gradient is approximated by:

single-stage model
Att2all[29] - 34.2 26.7 55.7 114.0 -
two-stages model
n-babytalk[24] 75.5 34.7 27.1 - 107.2 20.1
up-down[2] 79.8 36.3 27.7 56.9 120.1 21.4
scene graph based model
GCN-LSTM[37] 80.9 38.3 28.6 58.5 128.7 22.1
AUTO-ENC[36] 80.8 38.4 28.4 58.6 127.8 22.1
ALV[9] - 38.4 28.5 58.4 128.6 22.0
GCN-LSTM-HIP[38] - 39.1 28.9 59.2 130.6 22.3
transformer based model
Entangle-T[19] 81.5 39.9 28.9 59.0 127.6 22.6
AoA[14] 80.2 38.9 29.2 58.8 129.8 22.4
VORN[12] 80.5 38.6 28.7 58.4 128.3 22.6
Ours 80.8 39.5 29.1 59.0 130.8 22.8
Table 1: Comparison on MSCOCO Karpathy offline test split. means fusion of two models.

4 Experiment

4.1 Datasets and Evaluation Metrics

Our model is trained on the MSCOCO image captioning dataset [5]. We follow Karpathy’s splits [15], with 11,3287 images in the training set, 5,000 images in the validation set and 5,000 images in the test set. Each image has 5 captions as ground truth. We discard the words which occur less than 4 times, and the final vocabulary size is 10,369. We test our model on both Karpathy’s offline test set (5,000 images) and MSCOCO online testing datasets (40,775 images). We use Bleu [27], METEOR [3], ROUGE-L [20], CIDEr [31], and SPICE [1] as evaluation metrics.

4.2 Implementation Details

Following previous work, we first train Faster R-CNN on Visual Genome [18], use resnet-101 [11] as backbone, pretrained on ImageNet [7]. For each image, we can detect informative regions, the boundaries of each are first normalised and then used to compute the hierarchical graph matrices. We then train our proposed model for image captioning using the computed hierarchical graph matrices and extracted features for each image region. We first train our model with cross-entropy

loss for 25 epochs, the initial learning rate is set to

, and we decay the learning rate by every 3 epochs. Our model is optimized through Adam [16]

with a batch size of 10. We then further optimize our model by reinforced learning for another 35 epochs. The size of the decoder’s LSTM layer is set to 1024, and beam search of size 3 is used in the inference stage.

4.3 Experiment Results

We compare our model’s performance with published image captioning models. The compared models include the top performing single-stage attention model, Att2all [29]; two-stages attention based models, n-babytalk [24] and up-down [2]; visual scene graph based models, GCN-LSTM [37], AUTO-ENC [36], ALV [9], GCN-LSTM-HIP [38]; and transformer based models Entangle-T [19], AoA [14], VORN [12]. The comparison on the MSCOCO Karpathy offline test set is illustrated in Table 1. Our model achieves new state-of-the-art on the CIDEr and SPICE score, while other evaluation scores are comparable to the previous top performing models. Note that because most visual scene graph based models fused semantic and spatial scene graph, and require the auxiliary models to build the scene graph at first, our model is more computationally efficient. VORN [12] also integrated spatial attention in their model, and our model performs better than them among all kinds of evaluation metrics, which shows the superiority of our hierarchical graph. The MSCOCO online testing results are listed in Tab. 2, our model outperforms previous transformer based model on several evaluation metrics.

model B1 B4 M R C
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
scene graph based model
GCN-LSTM[37] 80.8 95.9 38.7 69.7 28.5 37.6 58.5 73.4 125.3 126.5
AUTO-ENC[36] - - 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
ALV[9] 79.9 94.7 37.4 68.3 28.2 37.1 57.9 72.8 123.1 125.5
GCN-LSTM-HIP[38] 81.6 95.9 39.3 71.0 28.8 38.1 59.0 74.1 127.9 130.2
transformer based model
Entangle-T[19] 81.2 95.0 38.9 70.2 28.6 38.0 58.6 73.9 122.1 124.4
AoA[14] 81.0 95.0 39.4 71.2 29.1 38.5 58.9 74.5 126.9 129.6
Ours 81.2 95.4 39.6 71.5 29.1 38.4 59.2 74.5 127.4 129.6
Table 2: Leaderboard of recent published models on the MSCOCO online testing server. means fusion of two models.

4.4 Ablation Study and Analysis

In the ablation study, we use AoA [14] as a strong baseline 111Our experiments are based on the code released at: (with a single multi-head dot-product attention module per layer), which add the gated linear layer [6] on top of the multi-head attention. In the encoder part, we study the hierarchy’s effect in the encoder, where we ablate the hierarchy by simply taking the mean output of three sub-transformers in each layer by reformulating Eqs. 6 and 7 as: . We also study where to use our proposed hierarchical graph encoding transformer layer in the encoding part: in the first layer, second layer, third layer or three of them? In the decoding part, we study the effect of the number of sub-transformers ( in Eq. 10) in the implicit decoding transformer layer.

model Bleu1 Bleu4 METEOR ROUGE-L CIDEr SPICE baseline(AoA) 77.0 36.5 28.1 57.1 116.6 21.3 positions to embed our hierarchical graph encoding transformer layer baseline+layer1 77.8 36.8 28.3 57.3 118.1 21.3 baseline+layer2 77.2 36.8 28.3 57.3 118.2 21.3 baseline+layer3 77.0 37.0 28.2 57.1 117.3 21.2 baseline+layer1,2,3 77.5 37.0 28.3 57.2 118.2 21.4 effect of hierarchy in the encoder baseline+layer1,2,3 w/o hierarchy 77.5 36.8 28.2 57.1 117.8 21.4 number of sub-transformers in the implicit decoding transformer layer baseline+layer1,2,3 (M=2) 77.5 37.6 28.4 57.4 118.8 21.3 baseline+layer1,2,3 (M=3) 78.0 37.4 28.4 57.6 119.1 21.6 baseline+layer1,2,3 (M=4) 77.5 37.8 28.4 57.5 118.6 21.4
Table 3: Ablation study, results reported without RL training. baseline+layer1 means only the first layer of encoding transformer uses our proposed hierarchical transformer layer, other layers use the original one. is the number of sub-transformers in the decoding transformer layer.

As we can see from Tab. 3, by widening the encoding transformer layer, there is a significant improvement on the model’s performance. While not every layers in the encoding transformer are equal, when we use our proposed transformer layer at the top layer of the encoding part, the improvement was reduced. This may be because spatial relationships at the top layer of the transformer are not as informative, we use our hierarchical transformer layer at all layers in the encoding part. When we reduce the hierarchy in our proposed wider transformer layer, there is also some performance reduction, which shows the importance of the hierarchy in our design. After widening the decoding transformer, the improvement was further increased (the CIDEr score increased from 118.2 to 119.1 after widening the decoding transformer layer with 3 sub-transformers), while not more wider gives better result, with 4 sub-transformers in the decoding transformer layer, there is some performance decrease, therefore the final design of our decoding transformer layer has 3 sub-transformers in parallel. The qualitative example of our models results is illustrated in Fig. 5. As we can see, the baseline model without spatial relationships wrongly described the police officers on a red bus (top right), and people on a train (bottom left).

Figure 5: Qualitative examples from our method on the MSCOCO image captioning dataset [5], compared against the ground truth annotation and a strong baseline method (AoA [14]).

Encoding implicit graph visualisation:

the transformer layer can be seen as an implicit graph, which relates the informative regions through dot-product attention. Here we visualise how our proposed hierarchical graph transformer layer learn to connect the informative regions through attention in Fig. 6. In the top example, the original transformer layer strongly relates the train with the people on the mountain, yields wrong description, while our proposed transformer layer relates the train with the tracks and mountain; in the bottom example, the original transformer relates the bear with its reflection in water and treats them as ‘two bears’, while our transformer can distinguish the bear from its reflection and relate it to the snow area.

Decoding feature space visualisation:

We also visualised the output of our decoding transformer layer (Fig. 7

). Compared to the original decoding transformer layer, which only has one sub-transformer inside it. The output of our proposed implicit decoding transformer layer covers a larger area in the reduced feature space than the original one, which means that our decoding transformer layer decoding more information in the image regions. In the original feature space (1,024 dimensions) from the output of decoding transformer layer, we compute the trace of the feature maps’ co-variance matrix from 1,000 examples, the trace for original transformer layer is

compared to for our wider decoding transformer layer, which indicates that our design enables the decoder’s output to cover a larger area in the feature space. However, it looks like individual sub-transformers in the decoding transformer layer still do not learn to disentangle different factors in the feature space (as there is no distinct cluster from the output of each sub-transformer), we speculate this is because we have no direct supervision to their output, which may not able to learn the disentangled feature automatically [22].

Figure 6: A visualization of how the query region relates to its other key regions through attention, the region in the red bounding box is the query region and other regions are key regions. The transparency of each key region shows its dot-product attention weight with the query region. Higher transparency means larger dot-product attention weight, vice versa.
(a) original
(b) ours
Figure 7: t-SNE [26] visualisation of the output from decoding transformer layer (1,000 examples), different color represent the output from different sub-transformers in the decoder in our model.

5 Discussion and Conclusion

In this work, we introduced the image transformer architecture. The core idea behind the proposed architecture is to widen the original transformer layer, designed for machine translation, to adapt it to the structure of images. In the encoder, we widen the transformer layer by exploiting the hierarchical spatial relationships between image regions, and in the decoder, the wider transformer layer can decode more information in the image regions. Extensive experiments were done to show the superiority of the proposed model, the qualitative and quantitative analyses were illustrated in the experiments to validate the proposed encoding and decoding transformer layer. Compared to the previous top models in image captioning, our model achieves a new state-of-the-art SPICE score, while in the other evaluation metrics, our model is either comparable or outperforms the previous best models, with a better computational efficiency.

We hope our work can inspire the community to develop more advanced transformer based architectures that can not only benefit image captioning but also other computer vision tasks which need relational attention inside it. Our code will be shared with the community to support future research.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: §2.3, §4.1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §1, §2.2, Table 1, §4.3.
  • [3] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.1.
  • [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667. Cited by: §1, §2.1.
  • [5] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §2.1, Figure 5, §4.1.
  • [6] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 933–941. Cited by: §2.4, §3.3, §4.4.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.1, §4.2.
  • [8] F. A. Gers, J. Schmidhuber, and F. Cummins (2000) Learning to forget: continual prediction with lstm. Neural Computation 12 (10), pp. 2451–2471. Cited by: §2.1.
  • [9] L. Guo, J. Liu, J. Tang, J. Li, W. Luo, and H. Lu (2019) Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 765–773. Cited by: §1, §2.3, Table 1, §4.3, Table 2.
  • [10] J. Hao, X. Wang, S. Shi, J. Zhang, and Z. Tu (2019)

    Multi-granularity self-attention for neural machine translation

    arXiv preprint arXiv:1909.02222. Cited by: §2.4.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
  • [12] S. Herdade, A. Kappeler, K. Boakye, and J. Soares (2019) Image captioning: transforming objects into words. In Advances in Neural Information Processing Systems, pp. 11135–11145. Cited by: §1, §1, §1, §2.4, Table 1, §4.3.
  • [13] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.3.
  • [14] L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4634–4643. Cited by: §1, §1, §1, §2.4, Table 1, Figure 5, §4.3, §4.4, Table 2.
  • [15] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §4.1.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [17] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §2.3.
  • [18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §2.2, §4.2.
  • [19] G. Li, L. Zhu, P. Liu, and Y. Yang (2019) Entangled transformer for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8928–8937. Cited by: §1, §1, §1, §2.4, Table 1, §4.3, Table 2.
  • [20] C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In

    Proc. ACL workshop on Text Summarization Branches Out

    pp. 10. Cited by: §4.1.
  • [21] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.2.
  • [22] F. Locatello, S. Bauer, M. Lucic, G. Rätsch, S. Gelly, B. Schölkopf, and O. Bachem (2019)

    Challenging common assumptions in the unsupervised learning of disentangled representations

    In Proceedings of the 36th International Conference on Machine Learning-Volume 97, pp. 4114–4124. Cited by: §4.4.
  • [23] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §2.1.
  • [24] J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7219–7228. Cited by: §1, §2.2, Table 1, §4.3.
  • [25] W. Luo, Y. Li, R. Urtasun, and R. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in neural information processing systems, pp. 4898–4906. Cited by: §2.1.
  • [26] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Figure 7.
  • [27] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [29] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3.4, Table 1, §4.3.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.4, §3.
  • [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §3.4, §4.1.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §2.1.
  • [33] X. Wang, Z. Tu, L. Wang, and S. Shi (2019) Self-attention with structural position representations. arXiv preprint arXiv:1909.00383. Cited by: §2.4.
  • [34] Y. Wang, H. Lee, and Y. Chen (2019) Tree transformer: integrating tree structures into self-attention. arXiv preprint arXiv:1909.06639. Cited by: §2.4.
  • [35] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §1, §2.1.
  • [36] X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §1, §2.3, Table 1, §4.3, Table 2.
  • [37] T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 684–699. Cited by: §1, §2.3, Table 1, §4.3, Table 2.
  • [38] T. Yao, Y. Pan, Y. Li, and T. Mei (2019) Hierarchy parsing for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2621–2629. Cited by: §1, §1, §2.3, Table 1, §4.3, Table 2.