Fusion Models for Improved Visual Captioning

10/28/2020 ∙ by Marimuthu Kalimuthu, et al. ∙ Universität Saarland 1

Visual captioning aims to generate textual descriptions given images. Traditionally, the captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them to often make mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with an aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.



There are no comments yet.


page 13

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of deep learning has seen tremendous progress ever since the breakthrough results of AlexNet 


on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Significant algorithmic improvements have since been achieved, both for language 

[30, 10]

and visual scene understanding 

[13, 6]. Further, there has been numerous efforts for the joint understanding of language and visual modalities in terms of defining novel tasks, creating benchmark datasets, and proposing new methods to tackle the challenges arising out of the multimodal nature of data [21]

. One specific problem that has piqued the interest of researchers and practitioners in Computer Vision (CV) and Natural Language Processing (NLP) is

Visual Captioning. Considerable progress has been made since the introduction of Show and Tell [32], which is an end-to-end model for image captioning. Despite these encouraging developments, the visual captioning models are still brittle, unreliable, prone to making unexplainable mistakes [26], and, most importantly, their performance is nowhere close to human-level understanding. However, although still a black-box, the recent unprecedented progress on language modeling has proven the potential of these models to generate semantically fluent and coherent text [4, 25] and encode powerful representations of language using bidirectional context [10]. Building on these developments, we propose to incorporate external language models into visual captioning frameworks to aid and improve their capabilities both for description generation and emendation. Although the proposed architecture (See Figure 1) could be used for both caption generation and correction, in this paper, we only focus on the task of emending captions. However, we describe the changes needed to the architecture to achieve caption generation. Broadly, our architecture consists of four major components, viz. a CNN encoder, an LSTM decoder, a pretrained AuxLM, and a Fusion module (See Figure 1). To the best of our knowledge, our architecture is novel since the current caption editing approaches [27, 28] do not leverage AuxLMs.

The rest of the paper is organized as follows. In Section 2, we briefly review relevant works on caption editing. In Section 3

, we provide background on fusion strategies that have been empirically shown to work well for neural machine translation, automatic speech recognition, and story generation and introduce our fusion architecture. Following this, we describe our experimental setup including the implementation details of our caption editing model in Section 

4. We then present our quantitative results along with a short qualitative analysis of the emended captions in Section 5. Finally, we conclude our paper in Section 6 discussing some future research directions.

2 Related Work

Recently, few approaches have been proposed for editing image captions [27, 28]. Although these methods have been shown to produce improved quantitative results, they have some limitations such as completely relying on labelled data for training which is limited in size and diversity. In addition, it is cost-intensive to obtain caption annotations and, sometimes, it is not even possible due to privacy concerns which is the case, for instance in medical domain. Furthermore, these approaches implicitly learn a language model on the decoder side with limited textual data. Moreover, they do not make use of external language models, i.e., AuxLMs, which can be trained on freely available unlabelled data, are more powerful since they learn rich language representations [10, 25], and are trained on enormous amounts of data with billions of parameters [25, 4]. To the best of our knowledge, there has been no previous works that incorporate AuxLMs into the caption model either for description generation or for correcting errors. To address the above stated limitations and leverage the advantages of AuxLMs as powerful language learners, we propose a generic framework that can accomplish both caption generation and editing tasks depending on the type of AuxLM used.

3 Fusion Techniques and Variations

Unimodal and multimodal model fusion has been explored extensively in the context of ASR [29, 7], Neural Machine Translation (NMT) [12], and hierarchical story generation [11]. However, to the best of our knowledge, there has been no similar works for visual captioning. In this section, we review these fusion methods and relate them to our goal of achieving image caption generation and emendation by using additional language models, i.e., AuxLMs.

3.0.1 Deep Fusion.

Gulcehre et al. [12] explored Deep Fusion

for NMT where a translation model trained on parallel corpora and an AuxLM trained on monolingual, target language data are combined using a trainable single layer neural network model. Typically, the AuxLM is an autoregressive LSTM-based recurrent neural network 

[20] while the translation model follows a typical sequence-to-sequence (Seq2Seq) [8] encoder-decoder architecture. To achieve Deep Fusion, a gate is learned from the hidden state of AuxLM and then concatenated with the hidden state of the translation model.

The drawback of this approach is that both AuxLM and the translation model are trained separately and kept frozen while the fusion layer is trained. Hence, they never get a chance to communicate and adapt their parameters during training. This forces these models to learn redundant representations.

3.0.2 Cold Fusion.

To address the limitation of Deep Fusion, Sriram et al. [29] introduced the so-called Cold Fusion strategy for ASR. The mechanism uses a pretrained AuxLM while training the Seq2Seq model. The AuxLM is still kept frozen as in the case of Deep Fusion, however, the parameters of the fusion layer and the Seq2Seq decoder are trained, as it leverages the already learned language representations of AuxLM. For more details see Section 3.2.2.

3.0.3 Hierarchical Fusion.

In a similar vein, Fan et al. [11] has explored a sophisticated fusion strategy for textual story generation in a hierarchical fashion where they combine a pretrained convolutional Seq2Seq model with a trainable convolutional Seq2Seq model. More specifically, the hidden states of a pretrained Seq2Seq model are fused with the hidden states of a trainable Seq2Seq model using a slightly modified version of the Cold Fusion [29] approach. We call this scheme as Hierarchical Fusion hereafter.

Considering the advantages of the fusion techniques of Sriram et al. [29] and Fan et al. [11] over Deep Fusion, we adapt them in our fusion model framework.

3.1 Auxiliary Language Model

Depending on the task, i.e., caption generation or editing, two classes of language models can be used as AuxLMs. For caption generation, since we will not have access to the right-side context during test time, an autoregressive (i.e., causal) language model [3, 19, 25, 4] that predicts next token using only the left-side context is qualified to be used as AuxLM. For caption editing task, however, a bidirectionally contextualized language model, for example BERT [10], would be better suited as AuxLM since they better encode sequences than the unidirectional counterparts. Also, intuitively, since our task is to edit captions, we already have a noisy version of captions, either from the baseline or from other off-the-shelf caption models, that we want to correct if there are errors and leave them unmodified otherwise. In our experiments, we use a pretrained BERT model (uncased) that has been trained on English Wikipedia and the BooksCorpus using a combination of masked language modeling and next-sentence prediction objectives. Although it is possible to fine-tune the pretrained AuxLMs on the target domain, for instance on the image captions in our case, before integrating it in the fusion model, we did not perform this step in our experiments. However, the finetuning step would likely prove to be beneficial if we intend to adapt the AuxLMs to more focused domains such as medical [24] or remote sensing [18] and we leave this pursuit for future work.

3.2 Fusion Strategies and Architecture

Inspired from the success of multimodal fusion strategies for ASR [29, 7] and unimodal fusion for NLG [12, 11], we slightly modify these fusion schemes and apply it for emending image captions. By doing so, we leverage the rich language representations of AuxLMs to achieve sentence-level fluency and grammatical correctness in the emended captions.

Figure 1: Architecture of our proposed fusion model. The encoded image is fed to the LSTM decoder only at the first time step. BERT MLM [10] has been pretrained to predict the masked token at current time step whereas the LSTM decoder is trained to predict the next token given all previous tokens. The Fusion Module can be any instance of the fusion schemes discussed in Section 3.2.

Figure 1 depicts the general architecture of the proposed fusion model. It contains four major components, namely a ConvNet encoder, an LSTM decoder, a BERT encoder (pretrained MLM), and a Fusion module. The architecture is flexible and can be used for two tasks, i.e., visual captioning and caption emendation, depending on the type of AuxLM chosen. For caption emendation, a text encoder such as BERT is used as an AuxLM. The LSTM decoder processes the sequence from left-to-right whereas the BERT model utilizes the entire sequence due to its inherent nature of encoding contextualized representation.

For the visual captioning task, the AuxLM must be an autoregressive model since we do not have access to the whole sequence at inference time. One possibility could be to replace the

BERT MLM component with an LSTM-based AuxLM or the recently proposed more powerful Transformer-based [30] autoregressive language model GPT-3 [4].

Further, for both captioning and emendation tasks, the Fusion Module component is flexible enough to support any sophisticated fusion method providing a framework and an opportunity for future works to come up with improved fusion schemes. In addition, the architecture could be employed in a domain-adapted way for visual captioning by integrating an AuxLM that has been trained in the domain of interest. There has been growing interest in recent years to automatically generate descriptions for medical images111https://www.imageclef.org/2020/medical/caption/ such as radiology outputs [24, 14]. The domain adapted AuxLM can be useful particularly in settings where labelled image caption data is scarce. However, unlabelled textual data is usually abundant. In such scenarios, an AuxLM can be trained on the target domain data and integrated into our fusion model framework for generating target domain specific image descriptions.

We now introduce the notations of hidden states used in our fusion model. Following Sriram et al. [29], we represent the final layer hidden states of pretrained BERT and trainable LSTM decoder as and respectively.

3.2.1 Simple Fusion (SF).

One of the simplest possible fusion mechanisms is through the concatenation of the hidden states of pretrained AuxLM and trainable visual captioning model, followed by a single projection layer with some non-linear activation function (

), such as ReLU.


The output of above non-linear transformation (

) can then be passed through a single linear layer with dropout to obtain prediction scores over the vocabulary.

3.2.2 Cold Fusion (CF).

A more sophisticated fusion can be achieved by introducing gates, thereby allowing the captioning model and AuxLM to moderate the information flow between them during the training phase. We slightly modify the cold fusion approach of Sriram et al. [29] in our fusion model which is as follows:


As with simple fusion, the representation is followed by a single linear layer with dropout (not shown here) to obtain prediction scores over the vocabulary.

3.2.3 Hierarchical Fusion (HF).

In the context of text generation, an advanced fusion mechanism based on Cold Fusion has been introduced by Fan et al. [11] for the open-ended and creative task of story generation. We adopt their way of model fusion with minor modifications, in the spirit of keeping the model simple. More specifically, after learning two separate gates followed by a concatenation, we only use a single linear layer with GLU activations [9] instead of 5. Further, to capture the rich sequence representation for caption editing, we use an MLM as AuxLM instead of a convolutional Seq2Seq model. We refer to Fan et al. [11] for full details of their fusion mechanism.


Again, the result of the final GLU (i.e., ) is passed through a single linear layer with dropout to obtain prediction scores over the image caption vocabulary.

In all our fusion methods (i.e., SF, CF, and HF), ; represents concatenation, stands for hadamard product, and indicates non-linear activation function, for which we use ReLU. The gating parameters W and b, which are part of the Fusion Module (see Figure 1), are learned while training the LSTM decoder of the caption model whereas all the parameters of BERT MLM are kept frozen.

4 Experiments

We train one baseline and three fusion models on the three commonly used image captioning datasets: Flickr8k, Flickr30k, and MS-COCO. Descriptions and implementation details are given below.

4.1 Baseline

This will be the model without any AuxLM component in its architecture. Any off-the-shelf visual captioning model satisfying this condition can be used as a baseline. The only requirement is that it should be possible to generate captions given the test set images from the dataset in question. In our experiments, we use Show, Attend, and Tell [34]

where we replace the original VGGNet with ImageNet pretrained ResNet-101 

[13] which encodes the images to a feature map of size 14 x 14 x 2048. For the decoder, we use the standard LSTM with two hidden layers. After training, we use a beam size of 5 for generating captions on the test sets of respective datasets.

4.2 Fusion Model Training

For each of the datasets, we train three caption models with different initializations for all the fusion techniques (i.e., SF, CF, and HF) proposed in Section 3.2.

4.2.1 Implementation Details.

First, we lowercase the captions and tokenize them using WordPiece222https://github.com/google-research/bert tokenization [33] in the same way the BERT model was trained. This consistency in tokenization is important for successful training since the captioning model relies on AuxLM for the hidden state representations at all time steps throughout the training and testing phases. Tokens appearing less than 5 times are replaced with a special unk

token yielding a vocabulary size of 25k. We implement our fusion models in Pytorch 


4.2.2 Decoder and Fusion Module Training.

The images are rescaled to a fixed size of 256 x 256 and encoded using ResNet101 [13] pretrained on ImageNet and kept frozen throughout training. As with the baseline model, we use an LSTM with 2 hidden layers and set the embedding and decoder dimensions to 1024. The LSTM decoder takes the token at current time step along with previous history and predicts the next token. The BERT model, however, consumes the entire sequence with the token at the next time step being masked using a special [MASK] token and it predicts this masked token (See Figure 1). The hidden state representations of both LSTM decoder and BERT are then passed to the Fusion Module, which can be any one of the fusion mechanisms discussed in Section 3.2, to predict the next token (seen from the perspective of the LSTM decoder).

We minimize the Cross Entropy loss using the Adam optimizer [15] with a learning rate of

and a batch size of 128. Initially, the model is scheduled to be trained for 7 epochs. However, the learning rate is halved or an early stopping is triggered if the validation BLEU did not improve for 2 and 4 consecutive epochs respectively.

4.2.3 Caption Emendation.

After the fusion model is trained, it can be used in inference mode to correct errors in the captions. As with the baseline model, we again use the same beam size 5 during our evaluations.

5 Results

We evaluate our models using both quantitative and qualitative approaches. In the following, we present each of them separately.

5.1 Quantitative Analysis

To evaluate our proposed models i.e., baseline and the fusion approaches, we use the standard metrics used for image captioning such as BLEU-{1-4} [22], METEOR [2], ROUGE-L [17], CIDEr [31], and SPICE [1]. Table 1 presents the average scores over three runs on the test sets of “Karpathy split”333https://cs.stanford.edu/people/karpathy/deepimagesent on the respective datasets.

Automatic Evaluation Measures
Dataset Model B-1 B-2 B-3 B-4 M R-L C S
BL 62.8 44.9 31.4 21.3 20.6 47.1 55.1 14.3
SF 64.6 46.6 32.8 22.8 21.2 47.8 56.9 14.8
Flickr8k CF 64.5 46.7 32.7 22.8 21.3 47.6 56.5 14.6
HF 64.1 45.8 32 21.8 20.9 47 55.5 14.4
BL 63.3 44.4 30.9 21.6 19.2 44.5 45.1 13.2
SF 64.7 45.6 32.0 22.4 19.7 44.9 46.7 13.6
Flickr30k CF 64.5 45.7 31.8 22.1 19.8 45 46.3 13.7
HF 64.6 45.4 31.7 22 19.4 45 46.2 13.3
BL 70.1 52.8 38.4 28 24.5 51.8 91.5 17.5
SF 70.8 53.7 40.4 30.2 25.1 52.6 94.6 17.8
MSCOCO CF 71 53.9 40.7 30.5 25.3 52.9 95 17.9
HF 70.9 53.8 40.6 30.5 25 52.7 94.8 17.8
Table 1: Results of proposed fusion methods on three benchmark image captioning datasets. BL-Baseline, SF-Simple fusion, CF - Cold fusion, HF - HNSG fusion, B-n - BLEU, M - METEOR, R-L - ROUGE-L, C - CIDEr, and S - SPICE.

It can be observed from Table 1 that all our fusion models outperform the baseline model. However, when we compare performance of fusion models among themselves we comprehend that there is no considerable difference. To be specific, on the MS-COCO dataset the Cold Fusion strategy outperforms other fusion techniques in all metrics while there is no clear winner for both Flickr8k and Flickr30k. Nevertheless we observe for the Flickr8k and Flickr30k datasets that Simple Fusion model is a preferable option, as it gives the largest BLEU-4 score. This can be attributed to our optimization criterion since all our models are optimized for BLEU-4 while training. This leads to the increase of the BLEU-4 score; especially for the Simple Fusion model trained on Flickr8k and Flickr30k which are much smaller datasets in comparison to MS-COCO.

5.2 Qualitative Analysis

In Figure 2, we present the token emendation distributions of the fusion techniques on all three datasets. When comparing the edits made by different fusion techniques, the distribution is similar. To understand edit distributions among datasets, we define token edit range as the range between smallest possible token edits, which is 1, to largest possible token edits, which is the maximum length of captions. We observe that the token edit range (1-3 for MS-COCO) is smaller than Flickr8k (1-5) and Flickr30k (1-4) even though MS-COCO is about 14x and 4x larger than Flickr8k and Flickr30k respectively. This indicates the challenging nature of the Flickr caption datasets where the baseline model makes more mistakes, for which case our fusion model editing has been more helpful.

(a) Flickr8k
(b) Flickr30k
Figure 2: Distribution of (token) corrections made by fusion models on Flickr8k, Flickr30k, and MS-COCO. X-axis represents how many tokens have been changed by the fusion model while the Y-axis shows the frequencies.

Owing to the criticism of the BLEU metric to correlate poorly with human judgements [5], we perform a preliminary study on the emendations of our fusion models to better understand the quality of emended captions. We identify several types of emendations and group them broadly into the following five categories based on whether the langauge or image-related attributes have been changed in the caption.

  1. Gender: Modification of gender to correctly describe image.

  2. Color: Modification of color to correctly describe image.

  3. Specificity: Emendations to achieve specific captions instead of generic ones.

  4. Syntactic: Emendation to achieve syntactic correctness.

  5. Semantic: Emendations to correctly describe the scene.

It should however be noted that this classification has been done with a preliminary study and a comprehensive human evaluation is needed to arrive at a more fine-grained classification. For illustrations, see Appendix 0.A.

6 Conclusion

In this paper, we have proposed a generic multimodal model fusion framework that can be utilized for both caption generation and editing tasks depending on the type of AuxLM that is integrated in the fusion model. We have implemented a caption editing model by integrating a pretrained BERT model and showed improved results over the baseline model on three image captioning benchmark datasets. Further, we conducted a preliminary qualitative analysis on the emended captions and identified a litany of categories based on the image or language-related attributes modified in the captions. For the future work, we plan to focus on three aspects. First, we will focus on utilizing the proposed fusion model for the caption generation task using a state-of-the-art autoregressive language model. Second, we aspire to employ our fusion model for automatic description generation of medical images while training a domain-adapted AuxLM. Third, we plan to conduct a human evaluation on the emended captions and come up with a fine-grained classification of errors corrected by our fusion model.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In ECCV, 2016, pp. 382–398. Cited by: §5.1.
  • [2] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §5.1.
  • [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin (2003) A neural probabilistic language model.

    Journal of Machine Learning Research

    3, pp. 1137–1155.
    Cited by: §3.1.
  • [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: Fusion Models for Improved Visual Captioning, §1, §2, §3.1, §3.2.
  • [5] C. Callison-Burch, M. Osborne, and P. Koehn (2006) Re-evaluating the role of Bleu in machine translation research. In 11th Conference EACL, Trento, Italy. Cited by: §5.2.
  • [6] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. CoRR abs/2005.12872. Cited by: §1.
  • [7] J. Cho, S. Watanabe, T. Hori, M. K. Baskar, H. Inaguma, J. Villalba, and N. Dehak (2019) Language model integration based on memory control for sequence to sequence speech recognition. In ICASSP, Brighton, United Kingdom, pp. 6191–6195. Cited by: §3.2, §3.
  • [8] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP, pp. 1724–1734. Cited by: §3.0.1.
  • [9] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th ICML, pp. 933–941. Cited by: §3.2.3.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 NAACL, pp. 4171–4186. Cited by: Fusion Models for Improved Visual Captioning, §1, §2, Figure 1, §3.1.
  • [11] A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of ACL, 2018, pp. 889–898. Cited by: Fusion Models for Improved Visual Captioning, §3.0.3, §3.0.3, §3.2.3, §3.2, §3.
  • [12] Ç. Gülçehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio (2015) On using monolingual corpora in neural machine translation. CoRR abs/1503.03535. Cited by: §3.0.1, §3.2, §3.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on CVPR, pp. 770–778. Cited by: §1, §4.1, §4.2.2.
  • [14] M. Kalimuthu, F. Nunnari, and D. Sonntag (2020) A competitive deep neural network approach for the imageclefmed caption 2020 task. CoRR abs/2007.14226. Cited by: §3.2.
  • [15] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR, San Diego, Cited by: §4.2.2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In NIPS, pp. 1106–1114. Cited by: §1.
  • [17] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §5.1.
  • [18] X. Lu, B. Wang, X. Zheng, and X. Li (2017) Exploring models and data for remote sensing image caption generation. Trans. on Geoscience and Remote Sensing. Cited by: §3.1.
  • [19] S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing LSTM language models. In 6th ICLR, Vancouver, Conference Track Proceedings, Cited by: §3.1.
  • [20] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur (2010) Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048. Cited by: §3.0.1.
  • [21] A. Mogadala, M. Kalimuthu, and D. Klakow (2019) Trends in integration of vision and language research: a survey of tasks, datasets, and methods. arXiv. Cited by: §1.
  • [22] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §5.1.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In NIPS, pp. 8026–8037. Cited by: §4.2.1.
  • [24] O. Pelka, C. M. Friedrich, A. Garcıa Seco de Herrera, and H. Müller (2020) Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In CLEF2020 Working Notes, CEUR Workshop Proceedings, Cited by: §3.1, §3.2.
  • [25] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1, §2, §3.1.
  • [26] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko (2018) Object hallucination in image captioning. In Proc. of EMNLP Brussels, pp. 4035–4045. Cited by: §1.
  • [27] F. Sammani and M. Elsayed (2019) Look and modify: modification networks for image captioning. In 30th BMVC, Cardiff, pp. 75. Cited by: §1, §2.
  • [28] F. Sammani and L. Melas-Kyriazi (2020) Show, edit and tell: a framework for editing image captions. In Proc. of CVPR, pp. 4808–4816. Cited by: §1, §2.
  • [29] A. Sriram, H. Jun, S. Satheesh, and A. Coates (2018) Cold fusion: training seq2seq models together with language models. In Proc. Interspeech 2018, pp. 387–391. Cited by: Fusion Models for Improved Visual Captioning, §3.0.2, §3.0.3, §3.0.3, §3.2.2, §3.2, §3.2, §3.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §3.2.
  • [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proc. of CVPR, pp. 4566–4575. Cited by: §5.1.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: A neural image caption generator. In CVPR, pp. 3156–3164. Cited by: §1.
  • [33] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.2.1.
  • [34] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §4.1.

Appendix 0.A Appendix

Here we provide examples of (token) corrections made by the fusion models by categorizing the edits into one of the following categories.

  1. Gender

  2. Color

  3. Specificity

  4. Syntactic

  5. Semantic

This classification is provided only for the purpose of preliminary illustration. For a thorough understanding of the trends in corrections and to draw conclusions, a detailed study using human evaluation must be performed on all three datasets. We leave this for future work.

Gender alteration

This section presents examples where the fusion models corrected the wrong gender of captions from the baseline model. We color the incorrect tokens in red, the correct replacements in green, and the equally valid tokens in brown.

baseline         :   a woman in a yellow shirt is holding a large stick
simple fusion :   a man in a yellow shirt is holding a large stick

baseline      :   a woman riding a wave on top of a surfboard
cold fusion :   a man riding a wave on top of a surfboard

Color correction

In this part, we show some examples where the fusion models emended color attributes in the captions of baseline model. We color the incorrect tokens in red, the correct replacements in green, and the equally valid tokens in brown.

baseline      :   a white teddy bear sitting next to a red teddy bear
simple fusion :   a white teddy bear sitting next to a brown teddy bear

baseline      :   a vase filled with pink flowers on a table
cold fusion  :   a vase filled with purple flowers on a table


This section provides examples to showcase the emendations of fusion models where the corrected captions end up describing the images more precisely than the baseline captions. We color incorrect tokens in red, correct replacements in green, and the equally valid tokens in brown.

baseline         :   a person sitting on a bench looking at the water
simple fusion :   a man sitting on a bench looking at the water

baseline                 :    a person in a helmet is riding a wave
hierarchical fusion :    a man wearing a harness is riding a wave

baseline      :   a group of people standing around a fruit stand
cold fusion  :   a group of people standing around a fruit market

Syntactic correction

In this section, we show examples where syntactic errors such as repetitions in the baseline captions are correctly emended by the fusion models. We color incorrect tokens in red, correct replacements in green, and the equally valid tokens in brown.

baseline         :   a white bowl filled with bananas and bananas
simple fusion :   a white bowl filled with bananas and nuts

baseline       :   a girl wearing a red and red striped shirt is walking on
                      a grassy hill
cold fusion  :   a girl in a red and black striped shirt is walking up a
                      grassy hill

Semantic correction

This section presents examples where the fusion models have corrected few tokens in the baseline captions so as to make them semantically valid. Edits to achieve semantic correctness may include emendation of attributes such as colors, object size, etc.

baseline      :   a man standing next to a sheep in a field
cold fusion :   a man standing next to cows in a field

baseline         :   a man wearing a black hat and red hat stands in front
                         of a brick wall
simple fusion :   a man in a black jacket and black hat stands in front
                         of a brick wall