A Semi-supervised Framework for Image Captioning

by   Wenhu Chen, et al.
ETH Zurich

State-of-the-art approaches for image captioning require supervised training data consisting of captions with paired image data. These methods are typically unable to use unsupervised data such as textual data with no corresponding images, which is a much more abundant commodity. We here propose a novel way of using such textual data by artificially generating missing visual information. We evaluate this learning approach on a newly designed model that detects visual concepts present in an image and feed them to a reviewer-decoder architecture with an attention mechanism. Unlike previous approaches that encode visual concepts using word embeddings, we instead suggest using regional image features which capture more intrinsic information. The main benefit of this architecture is that it synthesizes meaningful thought vectors that capture salient image properties and then applies a soft attentive decoder to decode the thought vectors and generate image captions. We evaluate our model on both Microsoft COCO and Flickr30K datasets and demonstrate that this model combined with our semi-supervised learning method can largely improve performance and help the model to generate more accurate and diverse captions.


page 4

page 13

page 14

page 15

page 16

page 17


Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Constructing an organized dataset comprised of a large number of images ...

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Existing research for image captioning usually represents an image using...

"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention

Generating stylized captions for an image is an emerging topic in image ...

nocaps: novel object captioning at scale

Image captioning models have achieved impressive results on datasets con...

Image Captioning with Attention for Smart Local Tourism using EfficientNet

Smart systems have been massively developed to help humans in various ta...

simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

The encode-decoder framework has shown recent success in image captionin...

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Generating textual descriptions for images has been an attractive proble...

1 Introduction

Giving the ability to a machine to describe an image has been a long standing goal in computer vision. It proved to be an extremely challenging problem for which interest has been renewed in recent years thanks to recent developments brought by deep learning techniques. Among these are two techniques originating from computer vision and natural language processing, namely convolutional 


and recurrent neural network architectures especially Long Short-term Memory Networks 


Among the most popular benchmark datasets for image captioning are MS-COCO and Flickr30K [4, 27] whose recent release helped accelerating new developments in the field. In 2015, a few approaches [12, 22, 34, 33, 6] set a very high standards on both datasets by reaching a BLEU-4 score of over 27% on MS-COCO and over 19% on Flickr30k. Most of the recent improvements have since then built on these systems and tried to figure out new ways to adapt the network architecture or improve the visual representation, e.g. using complex attention mechanisms [36, 35, 11].

While all these developments are significant, the performance of existing state-of-the-art approaches is still largely hindered by the lack of image-caption groundtruth data. The acquisition of this training data is a painstaking process that requires long hours of manual labor [18] and there is thus a strong interest in developing methods that require less groundtruth data. In this paper, we depart from the standard training paradigm and instead propose a novel training method that exploits large amounts of unsupervised text data without requiring the corresponding image content.

Our model is inspired by the recent success of sequence-to-sequence models in machine translation [1, 19], and multimodal recurrent structure models [34, 12, 36]. These methods encrypt images into a common vector space from which they can be decoded to a target caption space. Among these methods, ours is closely related to [36] which used a fully convolutional network (FCN [28]) and multi-instance learning (MIL [24]) to detect visual concepts from the images. These concepts are then mapped to a word vector space and serve as input to an attention mechanism of an LSTM-based decoder for caption generation. We follow their idea but we use image region features instead of semantic word embeddings as they typically carry more salient information. Besides we add an input reviewer - as suggested in [35] - in order to perform a given number of review steps, then output thought vectors. We feed these thought vectors to the attention mechanism of the attentive decoder. The resulting model is depicted in Figure 1 and relies on four building blocks: (i) a convolutional layer, (ii) a response map localizer, (iii) an attentive LSTM reviewer and (iv) an attentive decoder. All these steps will be detailed in Section 3.

Besides a novel architecture based on the work of [34, 35, 36], our main contribution lies in the use of out-of-domain textual data - without visual data - to pre-train our model. We propose a novel way to artificially generate the missing visual information in order to pretrain our model. Once the model is pre-trained, we start to feed pairwise in-domain visual textual training data to fine tune the model parameters. This semi-supervised approach yields significant improvements in terms of CIDEr [32], BLEU-4 [25] and METEOR [2]. Since unpaired textual data is largely available and easy to retrieve in various forms, our approach provides a novel paradigm to boost the performance of existing image captioning systems. Besides, we also made our code and pre-training dataset available on github 111https://github.com/wenhuchen/ETHZ-Bootstrapped-Captioning to support further progress based on our approach.

Figure 1: Overview of our image captioning system. First, an image is fed to a CNN architecture to rank the top visual concepts appearing in this image. The feature map localizer then traces back the regions with the strongest correlation to the detected visual concepts and extract regional visual features from them. Finally, the input reviewer aggregates these regional features and produces thought vectors which are then fed to an attentive decoder to generate the correct caption.

2 Related Work

Visual Concept Detector.

The problem of visual concept detection has been studied in the vision community for decades. Many challenges [29, 5, 18] are related to detecting objects from a given set of detectable concepts which are mainly restricted to visible entities such as ”cars” or ”pedestrians”. Other concepts such as ”sitting”, ”looking” or colors are typically ignored. These concepts are however important when describing the content of an image and ignoring them can thus severely hurt the performance of an image captioning system. This problem was partially addressed by the work of [6] who proposed a weakly supervised MIL algorithm, which is able to detect broader and more abstract concepts out of the images. A similar approach was proposed in [37] to learn weakly labeled concepts out of a set of images.

Image Description Generation.

Traditional methods for image captioning can be divided into two categories: (1) template-based methods such as [14] and [17], and (2) retrieval-based methods such as [15] and  [16]

. Template-based systems lack flexibility since the structure of the caption is fixed, the main task being to fill in the blanks of the predefined sentence. On the other hand, retrieval-based models heavily rely on the training data as new sentences can only be composed out of sentences coming from the training set. A recent breakthrough in image captioning came from the renewal of deep neural networks. Since then, a common theme has become utilizing both convolutional neural networks and recurrent neural networks for generating image descriptions. One of the early examples of this new paradigm is the work of 


that utilizes a deep CNN to construct an image representation, which is then fed to a bidirectional Recurrent Neural Networks. This architecture has since then become a

de facto standard in image captioning [23, 12, 21].

Another recent advance in the field of image captioning has been the use of attention models, initially proposed in 

[34] and quickly adopted by [11, 35] and others. These methods typically use spatially localized features computed from low layers in a CNN in order to represent fine-grained image context information while also relying on an attention mechanism that allows for salient features to dynamically become more dominant when needed. Another related approach is [36] that extracts visual concepts (as in [6]) and uses an attentive LSTM to generate captions based on the embeddings of the detected visual concepts. Two attention models are then used to first synthesize the visual concepts and then to generate captions.

Among all these approaches, [36, 34, 35] are the closest to ours in spirit. Our model borrows features from these existing systems. We for example make use of an input review module as suggested in [35] to encode semantic embedding into richer representation of factors. We then use a soft-attention mechanism [34] to generate attention weights for each factor, and we finally use beam search to generate a caption out of the decoder.

Leveraging External Training Data.

Most image captioning approaches are trained using paired image-caption data. Given that producing such data is an expensive process, there has been some interest in the community to train models with fewer data. The approach developed in [22] allows the model to enlarge its word dictionary to describe the novel concepts using a few examples and without extensive retraining. However this approach still relies on paired data. The approach closest to ours is [8] that focuses on transferring knowledge from a model trained on external unpaired data through weight sharing or weight composition. Due to the architecture of our model, we can simply ”fake” visual concepts from out-of-domain textual corpus and pre-train our model on the faked concept-caption pairwise data.

3 Model

Our captioning system is illustrated in Fig. 1 and consists of the following steps. Given an input image we first use a Convolutional Neural Network to detect salient concepts which are then fed to a reviewer to output thought vectors. These vectors along with the groundtruth captions are then used to train a soft attentive decoder similar to the one proposed in [34]. We detail each step in the following sections.

3.1 Visual Concept Detector

The first step in our approach is to detect salient visual concepts in an image. We follow the approach proposed in [6] and formulate this task as a multi-label classification problem. The set of output classes is defined by a dictionary consisting of the 1000 most frequent words in the training captions, from which the most frequent 15 stop words were discarded. The set covers around 92% of the word occurrences in the training data.

As pointed out in [6]

, a naive approach to image captioning is to encode full images as feature vectors and use a multilayer perceptron to perform classification. However, most concepts are region-specific and 

[6] demonstrated superior performance by applying a CNN to image regions and integrating the resulting information using a weakly-supervised variant of the Multiple Instance Learning (MIL) framework originally introduced in [24]

. We use this approach as the first step in our framework, we model the probability of a visual word

appearing in the image as a product of probabilities defined over a set of regions . Formally, we define this probability as


where indexes the image region from the response map ,

denotes the CNN features extracted over the region

, and and are the parameters of the CNN learned from data. Our concept detector architecture is taken from [6], which is trained with maximum-likelihood on image-concept pairwise data extracted from MS-COCO. Note that the visual concept detector is trained only on MS-COCO and we use the same model on Flickr30k.

Figure 2: Attention of visual concepts. We select the top 12 words and visualize their attention weights in different regions of the image.

3.2 Salient Regional Feature

A standard approach (see e.g. [36]) to encode information about the image is by using the semantic word embeddings [26] corresponding to the detected visual concepts (we here consider the top

concepts). The resulting word vectors are more compact than one-hot encoded vectors and capture many useful semantic concepts such as gender, location, comparative degrees, …

In this work, we also experiment with an approach that uses visual features extracted from the image sub-regions that have the strongest connections with the set of detected visual concepts . For each of the top concepts in , we compute image sub-regions as


In summary, we extract salient features from an image either in two ways:


where is an embedding matrix that maps a one-hot encoded vector to a more compact embedding space , and corresponds to the CNN features from image region . Note that has a different dimension in the two cases, as semantic feature while as visual feature .

As demonstrated in Section 4, we found that using CNN regional features can be advantageous over semantic word features. Our conjecture is that visual features are often more expressive since one image region can relate to multiple word choices.

3.3 Input Reviewer

As depicted in Figure 2, most of the detected visual concepts tend to capture very salient image concepts. However, some concepts are duplicated and others are incorrect such as ”intersection” in the example provided in Figure 2. We address this problem by applying an input reviewer [35] to synthesize ”thought vectors” that capture globally consistent concepts and eliminate mistakes introduced by the visual concept detector. Note that unlike the approach described in [35] that takes serialized CNN features for the whole image as input, we instead use the features described in Section 3.2. Since these features already have a strong semantic meaning, we did not apply the ”weight typing” and ”discriminative supervision” strategies suggested in [35].

Our input reviewer is composed of an attentive LSTM, which estimates an attention weight

for a given and outputs its hidden state as thought vectors . Formally,


where is the overview context vector from the last step. We use an attention function with parameters , defined as


where and are parameters learned from data, we set in our experiment.

In order to select which visual concepts to focus on, we could sample with regard to the attention weights , but a simpler approach described in [1] is to take a weighted sum over all inputs, i.e.


As shown on Figure 3, our LSTM reviewer uses and as inputs to produce the next thought vector . Unlike the LSTM decoder presented in the next section, it does not rely on the input symbols . The reviewer LSTM basically functions as a text synthesizer without any reliance on visual contexts, which explains why we can pre-train this part using only textual data (see Section 3.6).

Figure 3: LSTM model for the input reviewer (top) and the decoder (bottom).

3.4 Soft Attentive Decoder

Our decoder is based on [34] and is formulated as an LSTM network with attention mechanism on the thought vectors. The decoder LSTM depicted in Figure 3 takes as input both the set and input symbols from groundtruth captions (or word predictions that approximate the word distribution). The decoder estimates an attention weight based on the past context overview vector , past hidden state and thought vectors . Formally, we write it as


where , and , we use in our experiments.

Similarly to the input reviewer, we use a weighted sum over all thought vectors to approximate sampling


Unlike the input reviewer whose initial state is set to zero, we introduce visual information in the decoder by initializing the LSTM memory and state

with a linear transformation of CNN features, i.e.


where denotes the CNN features of the image, and are parameters learned from data. Note that is different from in Equation 1 in that extracts full image features from the upper layer, while extracts sub-region features from the response map.

3.5 Model Learning

The output state of the decoder LSTM contains all the useful information for predicting next word . We follow the implementation of [34] and add a two-layer perceptron with dropout on top of the decoder LSTM to predict the distribution for all words in the vocabulary. We calculate the cross-entropy loss based on the proposed distribution and groundtruth word .

We train our model using maximum likelihood with a regularization term on the attention weights and of the input reviewer and attentive decoder. Formally, we write


where is the groundtruth word, refers to all model parameters and is a balancing factor between the cross-entropy loss and a penalty on the attention weights. We use the penalty function described in [34] to ensure every concept and thought vector receives enough attention.

3.6 Semi-supervised Learning

Figure 4: Visualization of the faked visual regional features. We here show a projection of the features obtained by t-SNE for three different regions of the feature space. Words that are semantically or morphologically similar are clearly clustered together.

Most existing approaches to image captioning rely on pairs of images and corresponding captions for learning their model parameters. Such training data is typically expensive to produce and usually requires using crowd-sourcing techniques. The MS-COCO dataset was for instance annotated using Amazon’s Mechanical Turk, a process that required worker hours [18]. In contrast, unpaired text and image data is abundant and cheap to obtain but can not be used as is with current image captioning systems. We here suggest a novel approach to exploit text data without corresponding images to train our model. Since images are used as inputs to the visual concept detector to generate visual concepts, we need to ”fake” such information during the unsupervised training phase. We here propose two different methods for each of the two possible ways to encode visual concepts.

Fake Semantic Embeddings

In the case where the salient features described in Section 3.2 are based on semantic embeddings, we can directly fake these concepts based on the groundtruth sentences. This process is illustrated in Figure 5

. We experiment with two methods named ”Truth Generator” and ”Noisy Generator”. The ”Truth Generator” approach takes sample words from sentences longer than 15 words or zero-pad shorter sentences to generate 15 concepts. The ”Noisy Generator” mixes words sampled from the groundtruth sentences with randomly sampled words to form 15 concepts. Further details are provided in the appendix. Besides, we also experimented with out-of-domain text data with different sizes, i.e. 600K and 1.1M captions, which roughly corresponds to the number of training captions in MS-COCO.

Fake Regional Visual Features

The case of using salient regional features is more difficult to handle since our additional training data only consists of textual data without corresponding images. We propose to address this problem by relying on the strong correlation between visual concept and regional features. Specifically we construct a mapping from the concept space to the regional feature space. For simplicity, we assume the regional feature corresponding to each concept

follows a gaussian distribution. Thus, we can estimate its mean value by averaging all the regional features associated given concept in the training data, i.e.


where . We visualized these ”faked” regional features using t-SNE [20] and the results shown in Figure 4

demonstrate that the aggregated regional features capture similar properties to the ones of the semantic embeddings. Once we have established such mapping, we can artificially encode a given text as regional features. The “faked” regional features can thus be used as inputs for the unsupervised learning phase.

Unsupervised training

This training phase results in a two-step procedure. The first step is to pre-train our model on unpaired textual data, which teaches the model to produce captions based on out-of-domain language samples. Note that more than 60% of all the parameters can be pre-trained with our unsupervised learning framework, except the transformation matrix used to initialize , with CNN features (further details are given in the appendix). In the second phase of training, we optimize our model on in-domain supervised data (i.e. pairwise MS-COCO image-text dataset). As can be seen from Figure 7

, after only one epoch of in-domain adaptation, the performance already reaches a quite promising stage.

Figure 5: Faking Semantic Embeddings. We here illustrate how we train our model using out-of-domain text data. Starting from a sentence without corresponding annotations (blue boxes), we sample a given number of concepts shown in the green boxes. The red box shown in the example above is a noisy concept artificially introduced to make the model more robust.

4 Experiments

4.1 Data

We evaluate the performance of our model on the MS-COCO [18] and Flickr30K[27] datasets. The MS-COCO dataset contains 123,287 images for training and validation and 40775 images for testing, while Flickr30K provides 31,783 images for training and testing. For MS-COCO, we use the standard split described by Karpathy 222https://github.com/karpathy/neuraltalk2 for which 5000 images were used for both validation and testing and the rest for training. For Flickr30K, we follow the split of [11] using 1K images for both validation and test and the rest for training. During the pre-training phase, we use both the 2008-2010 News-CommonCrawl and Europarl corpus 333http://www.statmt.org/wmt11/translation-task.html#download as out-of-domain training data. Combined, these two datasets contain sentences, from which we removed sentences shorter than 7 words or longer than 30 words. We also filter out sentences with unseen words in the MS-COCO dataset. After filtering, we create two separate datasets of size 600K and 1.1M, which are then both tokenized and lowercased, and used for the pre-training phase. We train the model with a batch size of 256 and validate on an out-of-domain held-out set. The training is ended when the validation score converges or the maximum number of epochs is reached. After pre-training, we then use the trained parameters to initialize the in-domain training stage.

4.2 Experimental Setting

We use GloVe [26] 444 https://github.com/stanfordnlp/GloVe to train 300-dimensional semantic word embeddings on Wiki+Gigaword. We use full image features extracted from the CNN architecture, as in [30, 7]

, to initialize the decoder LSTM. In our experiments, we set the batch size to 256, vocabulary size to 9.6K, reviewer LSTM hidden size to 300 and decoder LSTM hidden layer size to 1000. We use rmsprop 

[31] with a learning rate of to optimize the model parameters. Training on MS-COCO takes around day to reach the best performance. We do model selection by evaluating the model on the validation set after every epoch, with the maximum training epoch set to 20. We keep the model with the best BLEU-4 score and evaluate its performance on the test set. We here only report the model performance on the test set. At test time, we do beam search with a beam size of 4 to decode words until the end of sentence symbol is reached.

4.3 Evaluation Results

DatasetModel MS-COCO Flickr30K
B-1 B-2 B-3 B-4 METEOR CIDEr B-1 B-2 B-3 B-4
Neuraltalk2 [12] 62.5 45.0 32.1 23.0 19.5 66.0 57.3 36.9 24.0 15.7
Soft Attention [34] 70.7 49.2 34.4 24.3 23.9 - 66.7 43.4 28.8 19.1
Hard Attention [34] 71.8 50.4 35.7 25.0 23.0 - 66.9 43.9 29.6 19.9
MSR [6] - - - 25.7 23.6 - - - - -
Google NIC [33] 66.6 46.1 32.9 24.6 - - 66.3 42.3 27.7 18.3
TWS [22] 68.5 51.2 37.6 27.9 22.9 81.9 - - - -
ATT-FCN(Ens) [36] 70.9 53.7 40.3 30.4 24.3 - 64.7 46.0 32.4 23.0
ATT-FCN(Sin) [36] 70.4 53.1 39.4 29.3 23.9 - 62.8 43.7 30.1 20.7
ERD+VGG [35] - - - 29.0 24.0 89.5 - - - -
Ours-Baseline 68.2 50.7 37.1 26.7 23.4 84.2 61.8 41.9 28.2 18.8
Ours-Fc7-Sm 68.6 50.7 37.3 27.7 23.6 85.5 61.9 43.0 29.4 19.6
Ours-Fc7-Rev-Sm 70.2 53.3 39.3 28.8 23.4 87.8 62.5 43.0 29.1 19.7
Ours-Fc7-Rev-Sm-Bsl(small) 70.1 53.9 39.9 29.5 23.8 90.4 63.5 44.3 30.5 20.8
Ours-Pool5-Rev-Sm-Bsl(small) 72.2 54.6 40.4 29.8 24.3 92.7 66.1 47.2 33.1 23.0
Ours-Pool5-Rev-Sm-Bsl(large) 72.3 54.7 40.5 30.0 24.5 93.4 66.5 47.3 33.1 22.7
Ours-Pool5-Rev-Sm-Bsl(noisy) 72.6 55.0 40.8 30.2 24.7 94.0 66.6 47.3 33.2 22.9
Ours-Fc7-Rev-Rf 70.6 53.6 39.5 29.0 23.6 87.4 61.8 42.9 29.4 20.0
Ours-Fc7-Rev-Rf-Bsl 71.4 54.6 40.6 30.1 24.3 91.3 64.2 45.5 31.7 21.9
Ours-Pool5-Rev-Rf-Bsl 72.5 55.1 41.0 30.6 24.8 95.0 66.4 47.3 33.3 23.0
Ours-Pool5-Rev-Rf-Sm-Bsl-Ens 72.9 55.8 41.6 30.9 24.8 95.8 66.9 47.8 33.7 23.3
Ours-Pool5-Rev-Rf-Bsl-Ens 73.4 56.5 42.5 32.0 25.2 98.2 67.2 48.2 34.0 23.8
Table 1: Performance in terms of BLEU-1,2,3,4, METEOR and CIDEr compared to other state-of-the-art methods on MS-COCO and Flickr30K dataset. For the competing methods, we report the performance results cited in the corresponding references. The numbers in red denotes the best known results, the numbers in blue denotes the second best known results, (-) indicates unknown scores. Note that all the scores are reported in percentage.

We use different standard evaluation metrics described in 

[4], including BLEU [25], a precision-based machine translation evaluation metric, METEOR [2], as well as CIDEr [32] which is a measure of human consensus. The results are shown in Table 1 where ”Ours-x” indicates the performance of different variants of our model. The ”Baseline” model takes visual concept embeddings as inputs to the attentive decoder without using any pre-training or visual feature for initialization. The ”Rev” variant adds an input reviewer in front of the attentive decoder to synthesize salient features from the images. The models with ”Fc7” and ”Pool5” respectively use the fc7 layer from VGG [30] and Pool5 layer from ResNet152 [7] for the decoder initialization. The latter brings significant improvements across all metrics. Models with ”Sm” use semantic embeddings as input to the reviewer, while ”Rf” use regional features. Models with ”Bsl” use our pre-trainig method while ”large” corresponds to using the 1.1M corpus, ”small” is for the 660K corpus, and ”noisy” means applying the Noisy Generator. Finally, ”Ens” means using an ensemble strategy to combine the results of 5 identical models ”Ours-Pools5-Rev-Rf-Bsl” which were trained independently with different initial parameters.

Our model without the unsupervised learning phase (Ours-Fc7-Rev-Sm) gets similar performance to ERD+VGG [35]

. When pre-training with out-of-domain data, our model outperforms its rival consistently across different metrics. We have also observed that the improvements on Flickr30K is more significant than on MS-COCO, which might partly be due to the smaller amount of training data for Flickr30K. When pre-training with out-of-domain data and combined with the reviewer module and ResNet152 features, our ensemble model improves significantly across several metrics and achieves state-of-art performance on both datasets. This clearly demonstrates that the unsupervised learning phase can not only increase n-grams precision but also adapts to human consensus by generating captions that are more diverse.

Semantic Embedding vs. Regional Features

The results in Table 1 show that regional features yield higher scores for most metrics. We also report results for an ensemble combining both features (see ours-Pool5-Rev-Rf-Sm-Bsl-Ens), which is shown to be inferior to the ensemble based on regional features alone (see Ours-Pool5-Rev-Rf-Bsl-Ens).

DatasetModel B@1 B@2 B@3 B@4 CIDEr METEOR
c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
ATT-LSTM-EXT (Ours) 73.4 91.0 56.3 82.3 42.3 71.4 31.7 60.2 96.4 97.4 25.4 34.1
ATT [36] 73.1 90.0 56.5 81.5 42.4 70.9 31.6 59.9 94.3 95.8 25.0 33.5
Google [33] 71.3 89.5 54.2 80.2 40.7 69.4 30.9 58.7 94.3 94.6 25.4 34.6
kimiyoung [35] 72.0 90.0 55.0 81.2 41.4 70.5 31.3 59.7 96.5 96.9 25.6 37.7
Table 2: Leaderboard of the published state-of-the-art image captioning models on the online COCO testing server (http://mscoco.org/dataset/#captions-leaderboard), where B@N, M, R, and C are short for BLEU@N, METEOR, and CIDEr scores. All values are reported as percentages (%).

Semi-supervised Learning.

We show the evolution of the BLEU-4 score on the validation set in Figure 7. We can see that when pre-trained on unsupervised data, the model starts with a very good accuracy and keeps increasing afterwards. In the end it outperforms the non pre-trained model by a large margin. We also experimented with the ”Truth Generator” and ”Noisy Generator” variants described in Section 3.6 with varying size of the corpus. The results are shown in Table 1. We observe that adding noise improves the performance in terms of most metrics, which indicates that a model trained with additional noise is more robust, thus producing more accurate captions. Besides, we see that simply enlarging the size of the training corpus (model with ”large” in the title) does not help achieve significantly better scores, which might be due to the fact that the additional data is taken from the same source as the smaller one.

Figure 6: Qualitative analysis of the impact of the pre-training procedure as well as the use of visual regional features.
Figure 7: Analysis of the impact of the unsupervised learning phase on training time. The y axis represents the BLEU-4 score on the validation set and the x axis denotes the number of epochs.

Sample Results.

We show examples of the captions produced by our model in Figure 6. We would like to make two observations from these examples: (1) using a pre-training phase on additional out-of-domain text data yields a model that can produce a wider variety of captions and (2) the regional features captures more adequate visual concepts which then yields more accurate textual descriptions.

Results on MS-COCO testing server

We also submitted our results to the MS-COCO testing server to evaluate the performance of our model on the official test set. Table 2 shows the performance Leaderboard for 5 reference captions (c5) and 40 reference captions (c40). Note that we applied the same setting as the best model reported in Table 1. Our model ranks among the top 10 in the Leaderboard.

5 Conclusion

We proposed a novel training method that exploits external text data without requiring corresponding images. This yields significant improvements in terms of the ability of the language model to generate more accurate captions. We also introduced a new model that brings some new improvements such as using salient regional features instead of traditional semantic word embeddings. Our new model together with the suggested pre-training method achieves state-of-the-art performance. Given the wide availability of text data, our pre-training method has the potential of largely improving the generality of most existing image captioning system, especially for domains with little paired training data.


6 Acknowledgments

We thank Prof. Juergen Rossmann and Dr. Schristian Schlette from RWTH Aachen University for their support as well as the Aachen Super Computing Center for providing GPU computing.

Appendix A Appendix - Implementation details

a.1 Input Reviewer

The input reviewer uses an LSTM to generate thought vectors. Formally, a thought vector is computed as


where are the forget/output/input gates and cell input/hidden/output states. These gates and states are controlled by the last thought vector and overview feature vector . , are the LSTM parameters learned from data. We use the LSTM cell output states directly as thought vectors. We set the LSTM state size in our experiments.

Semantic Features

When semantic features are used as input to the reviewer, we set , which is the same as the GloVe embedding size. As described, the reviewer thus contains around parameters.

Regional Features

When regional features are used as input to the reviewer, we set , which corresponds to the dimension of the convolutional feature map of the visual concept detector. The reviewer thus contains around parameters.

a.2 Decoder

Our decoder is also based on an LSTM architecture, but unlike the reviewer described previously, it also involves the embedding of the previous word as input. Formally, the word distribution is computed as


where are the forget/output/input gates and cell input/hidden/output states. These gates and states are controlled by the past cell output state , overview thought vector as well as input symbol . , are the decoder LSTM parameters. is the word embedding matrix, it transforms the one-hot vector into an embedding presentation . are multiple-layer perceptron parameters, which is used to estimate a word distribution. We set in our experiments, note that is the embedding size and is the vocabulary size. The number of parameters is around .

a.3 Details concerning the generation of visual concepts from pure text sentences

For a given caption we manually generate 10 visual concepts (out of the 1000 set of visual concepts in our dictionary). We achieve this by firstly going through the sentence to extract all the words belonging to the dictionary. In the case of the ”truth generator”, we sample 10 words if we have more than 10 extracted concepts or we append zeros if we have less. In the case of the ”noisy generator”, we firstly sample two noisy words randomly and then follow the previous procedure to get the additional 8 visual concepts. During data processing, we filtered out the sentences containing less than 4 concepts to make sure the number of ”truth words” is at least twice as many as the added noise.

a.4 Details Concerning Pre-trainable Parameters

Our unsupervised learning approach can be applied to the parameters of the decoder/reviewer except the transformation matrices which take the fc7 features as input, and whose parameter size is . Since the total parameter size is around , that is to say that more than of the model can be pre-trained.

a.5 Implementation

Our implementation uses Theano [3]

and Caffe 

[10] and is based on the code of [6] 555https://github.com/kelvinxu/arctic-captions and  [34] 666https://github.com/s-gupta/visual-concepts. The code will be made available on github after publication of our work. Our models were trained on a Tesla K20Xm graphics card with 6G Bytes of memory.

Appendix B Appendix - Visualization of Concept Attention & Captions

Figure 8: Additional examples of concept attention
Figure 9: Additional examples of concept attention
Figure 10: Additional examples of captions on the MS-COCO dataset. yellow: without pre-training, green: with pre-training, orange: groundtruth.
Figure 11: Additional examples of captions on the MS-COCO dataset. yellow: without pre-training, green: with pre-training, orange: groundtruth.
Figure 12: Additional examples of captions on the Flickr30K dataset. yellow: without pre-training, green: with pre-training, orange: groundtruth.