Unsupervised Image Captioning

Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first attempt to train an image captioning model in an unsupervised manner. Instead of relying on manually labeled image-sentence pairs, our proposed model merely requires an image set, a sentence corpus, and an existing visual concept detector. The sentence corpus is used to teach the captioning model how to generate plausible sentences. Meanwhile, the knowledge in the visual concept detector is distilled into the captioning model to guide the model to recognize the visual concepts in an image. In order to further encourage the generated captions to be semantically consistent with the image, the image and caption are projected into a common latent space so that they can be used to reconstruct each other. Given that the existing sentence corpora are mainly designed for linguistic research and thus with little reference to image contents, we crawl a large-scale image description corpus of 2 million natural sentences to facilitate the unsupervised image captioning scenario. Experimental results show that our proposed model is able to produce quite promising results without using any labeled training pairs.


page 3

page 7

page 13

page 14


Unpaired Image Captioning via Scene Graph Alignments

Deep neural networks have achieved great success on the image captioning...

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an importan...

Object-Centric Unsupervised Image Captioning

Training an image captioning model in an unsupervised manner without uti...

Recurrent Relational Memory Network for Unsupervised Image Captioning

Unsupervised image captioning with no annotations is an emerging challen...

Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Stylized image captioning systems aim to generate a caption not only sem...

Improving Image Captioning by Concept-based Sentence Reranking

This paper describes our winning entry in the ImageCLEF 2015 image sente...

Video Captioning Using Weak Annotation

Video captioning has shown impressive progress in recent years. One key ...

1 Introduction

The research on image captioning has made impressive progress in the past few years. Most of the proposed methods learn a deep neural network model to generate captions conditioned on an input image [8, 12, 15, 22, 23, 41, 42, 43]

. These models are trained in a supervised learning manner based on manually labeled image-sentence pairs, as illustrated in Figure

1 (a). However, the acquisition of these paired image-sentence data is a labor intensive process. The scales of existing image captioning datasets, such as Microsoft COCO [11]

, are relatively small compared with image recognition datasets, such as ImageNet 

[35] and OpenImage [27]. The image and sentence varieties within these image captioning datasets are limited to be under 100 object categories. As a result, it is difficult for the captioning models trained on such paired image-sentence data to generalize to images in the wild [38]. Therefore, how to relieve the dependency on the paired captioning datasets and make use of other available data annotations to generalize image captioning models is becoming increasingly important, and thus warrants deep investigations.

Figure 1: Conceptual differences between the existing captioning methods: (a) supervised captioning [41], (b) novel object captioning [2, 3], (c) cross-domain captioning [9, 44], (d) pivot captioning [17], (e) semi-supervised captioning [10], and (f) our proposed unsupervised captioning.

Recently, there have been several attempts at relaxing the reliance on paired image-sentence data for image captioning training. As shown in Figure 1 (b), Hendricks et al. [3] proposed to generate captions for novel objects, which are not present in the paired image-caption training data but exist in image recognition datasets, e.g., ImageNet. As such, novel object information can be introduced into the generated captioning sentence without additional paired image-sentence data. A thread of work [9, 44] proposed to transfer and generalize the knowledge learned in existing paired image-sentence datasets to a new domain, where only unpaired data is available, as shown in Figure 1 (c). In this way, no paired image-sentence data is needed for training a new image captioning model in the target domain. Recently, as shown in Figure 1 (d), Gu et al. [17] proposed to generate captions in a pivot language (Chinese) and then translate the pivot language captions to the target language (English), which requires no more paired data of images and target language captions. Chen et al. [10] proposed a semi-supervised framework for image captioning, which uses an external text corpus, shown in Figure 1 (d), to pre-train their image captioning model. Although these methods have achieved improved results, a certain amount of paired image-sentence data is indispensable for training the image captioning models.

To the best of our knowledge, no work has explored unsupervised image captioning, i.e., training an image captioning model without using any labeled image-sentence pairs. Figure 1 (f) shows this new scenario, where only one image set and one external sentence corpus are used in an unsupervised training setting, which, if successful, can dramatically reduce the labeling work required to create a paired image-sentence dataset. However, it is very challenging to figure out how we can leverage the independent image set and sentence corpus to train a reliable image captioning model.

Recently, several models, relying on only monolingual corpora, have been proposed for unsupervised neural machine translation 

[4, 28]

. The key idea of these methods is to map the source and target languages into a common space by a shared encoder with cross-lingual embeddings. Compared with unsupervised machine translation, unsupervised image captioning is even more challenging. The images and sentences reside in two modalities with significantly different characteristics. Convolutional neural network (CNN) 


usually acts as an image encoder, while recurrent neural network (RNN) 

[20] is naturally suitable for encoding sentences. Due to their different structures and characteristics, the encoders of image and sentence cannot be shared, as in unsupervised machine translation.

In this paper, we make the first attempt to train image captioning models without any labeled image-sentence pairs. Specifically, three key objectives are proposed. First, we train a language model on the sentence corpus using the adversarial text generation method 

[14], which generates a sentence conditioned on a given image feature. As illustrated in Figure 1 (f), we do not have the ground-truth caption of a training image in the unsupervised setting. Therefore, we employ adversarial training [16] to generate sentences such that they are indistinguishable from the sentences within the corpus. Second, in order to ensure that the generated captions contain the visual concepts in the image, we distill [19] the knowledge provided by a visual concept detector into the image captioning model. Specifically, a reward will be given when a word, which corresponds to the detected visual concepts in the image, appears in the generated sentence. Third, to encourage the generated captions to be semantically consistent with the image, the image and sentence are projected into a common latent space. Given a projected image feature, we can decode a caption, which can further be used to reconstruct the image feature. Similarly, we can encode a sentence from the corpus to the latent space feature and thereafter reconstruct the sentence. By performing bi-directional reconstruction, the generated sentence is forced to closely represent the semantic meaning of the image, in turn improving the image captioning model.

Moreover, we develop an image captioning model initialization pipeline to overcome the difficulties of training from scratch. We first take the concept words in a sentence as input and train a concept-to-sentence model using the sentence corpus only. Next, we use the visual concept detector to recognize the visual concepts present in an image. Integrating these two components together, we are able to generate a pseudo caption for each training image. The pseudo image-sentence pairs are used to train a caption generation model in the standard supervised manner, which then serves as the initialization for our image captioning model.

Figure 2: The architecture of our unsupervised image captioning model, consisting of an image encoder, a sentence generator, and a discriminator. A CNN encodes a given image into a feature representation, based on which the generator outputs a sentence to describe the image. The discriminator is used to distinguish whether a caption is generated by the model or from the sentence corpus. Moreover, the generator and discriminator are coupled in a different order to perform image and sentence reconstructions. The adversarial reward, concept reward, and image reconstruction reward are jointly introduced to train the generator via policy gradient. Meanwhile, the generator is also updated by gradient descent to minimize the sentence reconstruction loss. For the discriminator, its parameters are updated by the adversarial loss and image reconstruction loss via gradient descent.

In summary, our contributions are four-fold:

  • [noitemsep]

  • We make the first attempt to conduct unsupervised image captioning without relying on any labeled image-sentence pairs.

  • We propose three objectives to train the image captioning model. First, adversarial training is used to generate one sentence description for a given image without resorting to its ground-truth caption. Second, we distill the knowledge from a visual concept detector into the image captioning model. Third, we perform the alignment and bi-directional reconstructions between the image and sentence to encourage the generated sentence to be more semantically correlated with the given image.

  • We propose a novel model initialization pipeline exploiting unlabeled data. By leveraging the visual concept detector, we generate a pseudo caption for each image and initialize the image captioning model using the pseudo image-sentence pairs.

  • We crawl a large-scale image description corpus consisting of over 2 million sentences from the Web for the unsupervised image captioning task. Our experimental results demonstrate the effectiveness of our proposed model in producing quite promising image captions. We also compare the proposed method against [17] under the same unpaired image captioning setting, achieving the superior performance.

2 Related Work

2.1 Image Captioning

Supervised image captioning has been extensively studied in the past few years. Most of the proposed models use one CNN to encode an image and one RNN to generate a sentence describing the image [41]

, respectively. These models are trained to maximize the probability of generating the ground-truth caption conditioned on the input image. As paired image-sentence data is expensive to collect, some researchers tried to leverage other data available to improve the performances of image captioning models. Anderson

et al. [2] trained an image caption model with partial supervision. Incomplete training sequences are represented by finite state automaton, which can be used to sample complete sentences for training. Chen et al. [9] developed an adversarial training procedure to leverage unpaired data in the target domain. Although improved results have been obtained, the novel object captioning or domain adaptation methods still need paired image-sentence data for training. Gu et al. [17] proposed to first generate captions in a pivot language and then translate the pivot language caption to the target language. Although no image and target language caption pairs are used, their method depends on image-pivot pairs and a pivot-target parallel translation corpus. In contrast to the methods aforementioned, our proposed method does not need any paired image-sentence data.

2.2 Unsupervised Machine Translation

Unsupervised image captioning is similar in spirit to unsupervised machine translation, if we regard the image as the source language. In the unsupervised machine translation methods [4, 28, 29], the source language and target language are mapped into a common latent space so that the sentences of the same semantic meanings in different languages can be well aligned and the following translation can thus be performed. However, the unsupervised image captioning task is more challenging because images and sentences reside in two modalities with significantly different characteristics.

3 Unsupervised Image Captioning

Unsupervised image captioning relies on a set of images , a set of sentences , and an existing visual concept detector, where and are the total numbers of images and sentences, respectively. Please note that the sentences are obtained from an external corpus, which is not related to the images. For simplicity, we will omit the subscripts and use and to represent an image and a sentence, respectively. In the following, we first describe the architecture of our image captioning model. Afterwards, we will introduce how to perform the training based on the given data.

3.1 The Model

As shown in Figure 2, our proposed image captioning model consists of an image encoder, a sentence generator, and a sentence discriminator.

Encoder. One image CNN encodes the input image into one feature representation :


Common image encoders, such as Inception-ResNet-v2 [37] and ResNet-50 [18], can be used here. In this paper, we simply choose Inception-V4 [37] as the encoder.

Generator.Long short-term memory (LSTM) [20]

, acting as the generator, decodes the obtained image representation into a natural sentence to describe the image content. At each time-step, the LSTM outputs a probability distribution over all the words in the vocabulary conditioned on the image feature and previously generated words. The generated word is sampled from the vocabulary according to the obtained probability distribution:


where FC and denote the fully-connected layer and sampling operation, respectively. is the length of the generated sentence with denoting the word embedding matrix. , , , and

are the LSTM input, a one-hot vector representation of the generated word, LSTM hidden state, and the probability over the dictionary at the

-th time step, respectively. and denote the start-of-sentence (SOS) and end-of-sentence (EOS) tokens, respectively. is initialized with zero. For unsupervised image captioning, the image is not accompanied by sentences describing its content. Therefore, one key difference between our generator and the sentence generator in [41] is that is sampled from the probability distribution , while the LSTM input word is from the ground-truth caption during training in [41].

Discriminator. The discriminator is also implemented as an LSTM, which tries to distinguish whether a partial sentence is a real sentence from the corpus or is generated by the model:


where is the hidden state of the LSTM. indicates the probability that the generated partial sentence is regarded as real by the discriminator. Similarly, given a real sentence from the corpus, the discriminator outputs , where is the length of . is the probability that the partial sentence with the first words in is deemed as real by the discriminator.

3.2 Training

As we do not have any paired image-sentence data available, we cannot train our model in the supervised learning manner. In this paper, we define three novel objectives to make unsupervised image captioning possible.

Figure 3: The architectures for image reconstruction (a) and sentence reconstruction (b), respectively, with the generator and discriminator coupled in a different order.

3.2.1 Adversarial Caption Generation

The sentences generated by the image captioning model need to be plausible to human readers. Such a goal is usually ensured by training a language model on a sentence corpus. However, as discussed before, the supervised learning approaches cannot be used to train the language model in our setting. Inspired by the recent success of the adversarial text generation method [14], we employ the adversarial training [16] to ensure the plausible sentence generation. The generator takes an image feature as input and generates one sentence conditioned on the image feature. The discriminator distinguishes whether a sentence is generated by the model or is a real sentence from the corpus. The generator tries to fool the discriminator by generating sentences as real as possible. To achieve this goal, we give the generator a reward at each time-step and name this reward as adversarial reward. The reward value for the

-th generated word is the logarithm of the probability estimated by the discriminator:


By maximizing the adversarial reward, the generator gradually learns to generate plausible sentences. For the discriminator, the corresponding adversarial loss is defined as:


3.2.2 Visual Concept Distillation

The adversarial reward only encourages the model to generate plausible sentences following grammar rules, which may be irrelevant to the input image. In order to generate relevant image captions, the captioning model must learn to recognize the visual concepts in the image and incorporate such concepts into the generated sentence. Therefore, we propose to distill the knowledge from an existing visual concept detector into the image captioning model. Specifically, when the image captioning model generates a word whose corresponding visual concept is detected in the input image, we give a reward to the generated word. Such a reward is called a concept reward, with the reward value indicated by the confidence score of that visual concept. For an image , the visual concept detector outputs a set of concepts and corresponding confidence scores: , where is the -th detected visual concept, is the corresponding confidence score, and is the total number of detected visual concepts. The concept reward assigned to the -th generated word is given by:


where is the indicator function.

3.2.3 Bi-directional Image-Sentence Reconstruction

With the adversarial training and concept reward, the captioning quality would be largely determined by the visual concept detector because it is the only bridge between images and sentences. However, the existing visual concept detectors can only reliably detect a limited number of object concepts. The image captioning model should understand more semantic concepts of the image for a better generalization ability. To achieve this goal, we propose to project the images and sentences into a common latent space such that they can be used to reconstruct each other. Consequently, the generated caption would be semantically consistent with the image.

Image Reconstruction. The generator produces a sentence conditioned on an image feature, as shown in Figure 3 (a). The sentence caption should contain the gist of the image. Therefore, we can reconstruct the image from the generated sentence, which can encourage the generated captions to be semantically consistent with the image. However, one hurdle for doing so lies in that it is very difficult to generate images containing complex objects, e.g., people, of high-resolution using current techniques [6, 7, 25]. Therefore, in this paper, we turn to reconstruct the image features instead of the full image. As shown in Figure 3 (a), the discriminator can be viewed as a sentence encoder. A fully-connected layer is stacked on the discriminator to project the last hidden state to the common latent space for images and sentences:


where can be further viewed as the reconstructed image feature from the generated sentence. Therefore, we define an additional image reconstruction loss for training the discriminator:


It can also be observed that the generator together with the discriminator constitutes the image reconstruction process. Therefore, an image reconstruction reward for the generator, which is proportional to the negative reconstruction error, can be defined as:


Sentence Reconstruction. Similarly, as shown in Figure 3 (b), the discriminator can encode one sentence and project it into the common latent space, which can be viewed as one image representation related to the given sentence. The generator can reconstruct the sentence based on the obtained representation. Such a sentence reconstruction process could also be viewed as a sentence denoising auto-encoder [40]. Besides aligning the images and sentences in the latent space, it also learns how to decode a sentence from an image representation in the common space. In order to make a reliable and robust sentence reconstruction, we add noises to the input sentences by following [28]. The objective of the sentence reconstruction is defined as the cross-entropy loss:


where is the -th word in sentence .

3.2.4 Integration

The three objectives are jointly considered to train our image captioning model. For the generator, as the word sampling operation is not differentiable, we train the generator using policy gradient [36], which estimates the gradients with respect to trainable parameters given the joint reward. More specifically, the joint reward consists of adversarial reward, concept reward, and image reconstruction reward. Besides the gradients estimated by policy gradient, the sentence reconstruction loss also provides gradients for the generator via back-propagation. These two types of gradients are both employed to update the generator. Let denote the trainable parameters in the generator. The gradient with respect to is given by:


where is a decay factor, and is the baseline reward estimated using self-critic [34]. , and are the hyper-parameters controlling the weights of different terms.

For the discriminator, the adversarial and image reconstruction losses are combined to update the parameters via gradient descent:


During the training process, the generator and discriminator are updated alternatively.

3.3 Initialization

It is challenging to adequately train our image captioning model from scratch with the given unpaired data, even with the proposed three objectives. Therefore, we propose an initialization pipeline to pre-train the generator and discriminator.

Regarding the generator, we would like to generate a pseudo caption for each training image, and then use the pseudo image-caption pairs to initialize an image captioning model. Specifically, we first build a concept dictionary consisting of the object classes in the OpenImages dataset [27]. Second, we train a concept-to-sentence (con2sen) model using the sentence corpus only. Given a sentence, we use a one-layer LSTM to encode the concept words within the sentence into a feature representation, and use another one-layer LSTM to decode the representation into the whole sentence. Third, we detect the visual concepts in each image using the existing visual concept detector. With the detected concepts and the concept-to-sentence model, we are able to generate a pseudo caption for each image. Fourth, we train the generator with the pseudo image-caption pairs using the standard supervised learning method as in [41]. Such an image captioner is named as feature-to-sentence (feat2sen) and used to initialize the generator.

Regarding the discriminator, parameters are initialized by training an adversarial sentence generation model on the sentence corpus.

4 Experiments

In this section, we evaluate the effectiveness of our proposed method. To quantitatively evaluate our unsupervised captioning method, we use the images in the MSCOCO dataset [11] as the image set (excluding the captions). The sentence corpus is collected by crawling the image descriptions from Shutterstock111https://www.shutterstock.com. The object detection model [21] trained on OpenImages [27] is used as the visual concept detector. We first introduce sentence corpus crawling and experimental settings. Next, we present the performance comparisons as well as the ablation studies.

Figure 4: Two images and their accompanying descriptions from Shutterstock.

4.1 Shutterstock Image Description Corpus

We collect a sentence corpus by crawling the image descriptions from Shutterstock for the unsupervised image captioning research. Shutterstock is an online stock photography website, which provides hundreds of millions of royalty-free stock images. Each image is uploaded with a description written by the image composer. Some images and description samples are shown in Figure 4. We hope that the crawled image descriptions are, to be somewhat, related to the training images. Therefore, we directly use the name of the eighty object categories in the MSCOCO dataset as the searching keywords. For each keyword, we download the search results of the top one thousand pages. If the number of pages available is less than one thousand, we will download all the results. There are roughly a hundred images in one page, resulting in descriptions for each object category. After removing the sentences with less than eight words, we collect distinct image descriptions in total.

max width= Method B1 B2 B3 B4 M R C S Ours w/o init 35.3 18.2 8.6 4.4 10.5 24.8 20.9 6.1 Ours 39.4 21.1 10.0 4.8 11.4 27.2 23.3 7.0 con2sen 35.9 18.7 8.7 4.1 11.5 26.3 17.6 7.0 feat2sen 38.0 20.4 9.6 4.7 11.6 27.2 19.5 6.6 adv 34.8 16.6 6.9 3.3 9.1 24.5 12.5 3.9 adv + con 36.6 18.4 8.3 3.9 10.7 25.5 19.7 6.3 adv + con + im 35.5 17.4 8.0 3.9 10.6 25.4 19.9 6.3

Table 1: Performance comparisons of unsupervised captioning methods on the test split [24] of the MSCOCO dataset.
Figure 5: The qualitative results by the unsupervised captioning models trained with different objectives. Best viewed by zooming in.

4.2 Experimental Settings

Following [24], we split the MSCOCO dataset: 113,287 images for training, 5,000 images for validation, and the remaining 5,000 images for testing. Please note that the training images are used to build the image set, with the corresponding captions left unused for any training. All the descriptions in the Shutterstock image description corpus are tokenized by the NLTK toolbox [5]. We build a vocabulary by counting all the tokenized words and removing the words with frequencies lower than 40. The object category names of the used object detection model are then merged into the vocabulary. Finally, there are words in our vocabulary, including special SOS, EOS, and an Unkown token. We perform a further filtering process by removing the sentences containing more than 15% Unknown tokens. After filtering, we retain sentences.

The LSTM hidden dimension and the shared latent space dimension are both fixed to 512. The weighting hyper-parameters are chosen to make different rewards roughly at the same scale. Specifically, , , and are set to be 10, , and 1, respectively. is set to be . We train our model using the Adam optimizer [26] with a learning rate of 0.0001. During the initialization process, we minimize the cross-entropy loss using Adam with the learning rate 0.001. When generating the captions in the test phase, we use beam search with a beam size of 3.

We report the BLEU [32], METEOR [13], ROUGE [31], CIDEr [39], and SPICE [1] scores computed with the coco-caption code 222https://github.com/tylin/coco-caption

. The ground-truth captions of the images in the test split are used for computing the evaluation metrics.

4.3 Experimental Results and Analysis

The top region of Table 1 illustrates the unsupervised image captioning results on the test split of the MSCOCO dataset. The captioning model obtained with the proposed unsupervised training method achieves promising results, with CIDEr as 23.3%. Moreover, we also report the results of training our model from scratch (“Ours w/o init”) to verify the effectiveness of our proposed initialization pipeline. Without initialization, the CIDEr value drops to 20.9%, which shows that the initialization pipeline can benefit the model training and thus boost image captioning performances.

Ablation Studies. The results of the ablation studies are illustrated in the bottom region of Table 1. It can be observed that “con2sen” and “feat2sen” generate reasonable results with CIDEr as 17.6% and 19.5%, respectively. As such, “con2sen” can be used to generate pseudo image-caption pairs for training “feat2sen”. And “feat2sen” can make a meaningful initialization of the generator of our captioning model.

When only the adversarial objective is introduced to train the captioning model, “adv” alone leads to much worse results. One cause for this is due to the linguistic characteristics of the crawled image descriptions from Shutterstock, which are significantly different from those of the COCO captions. Another cause is that the adversarial objective only enforces genuine sentence generation but does not ensure its semantic correlation with the image content. Because of the linguistic characteristic difference, most metrics also drop even after introducing the concept objective in “adv + con” and further incorporating the image reconstruction objective in “adv + con + im”. Although the generated sentences of these two baselines may look plausible, the evaluation results with respect to the COCO captions are not satisfactory. However, by considering all the objectives, our proposed method substantially improves the captioning performances.

Qualitative Results. Figure 5 shows some qualitative results of unsupervised image captioning. In the top-left image, the object detector fails to detect “laptop”, so the “con2sen” model says nothing about the laptop. On the contrary, the other models successfully recognize the laptop and incorporate such concept into the generated caption. In the top-right image, only a small region of the cat is visible. With such a small region, our full captioning model recognizes that it is “a black and white cat”. The object detector cannot provide any information about color attributes. We are pleased to see that the bi-directional reconstruction objective is able to guide the captioning model to recognize and express such visual attributes in the generated description sentence. In the bottom two images, “vehicle” and “hat” are detected by error, which severely affects the results of “con2sen”. On the contrary, after training the captioning model with the proposed objectives, the captioning model is able to correct such errors and generate plausible captions333More qualitative results can be found in the supplemental materials..

Figure 6: The average number of correct concept words in each sentence generated during the training process.

Effect of Concept Reward. Figure 6 shows the average number of correct concept words in each sentence generated during the training process. It can be observed that the number of “adv” drops quickly in the beginning. The reason is that the adversarial objective is not related to the visual concepts in the image. “Ours w/o init” continuously increases from zero to about 0.6. The concept reward consistently improves the ability of the captioning model to recognize visual concepts. For “adv + con”, “adv + con + im”, and “Ours”, the number is about 0.8. One reason is that the initialization pipeline gives a good starting point. Another possible reason is that the concept reward prevents the captioning model from drifting towards degradation.

4.4 Performance Comparisons under the Unpaired Captioning Setting

The performances of the unsupervised captioning models may seem unsatisfactory in terms of the evaluation metrics on the COCO test split. This is mainly due to the different linguistic characteristics between COCO captions and crawled image descriptions. To further demonstrate the effectiveness of the proposed three objectives, we compare with [17] under the same unpaired captioning setting, where the COCO captions of the training images are used but in an unpaired manner. Specifically, we replace the crawled sentence corpus with the COCO captions of the training images. All the other settings are kept the same as the unsupervised captioning setting. A new vocabulary with words is created by counting all the words in the training captions and removing the words with frequencies less than 4.

The results of unpaired image captioning are shown in Table 2. It can be observed that the captioning model can be consistently improved based on the unpaired data, by including the three proposed objectives step by step. Due to exposure bias [33], some of the captions generated by “feat2sen” are poor sentences. The adversarial objective encourages these generated sentences to appear genuine, resulting in improved performances. Through only adversarial training, the model tends to generate sentences unrelated to the image. This issue is mitigated by the concept reward and thus “adv + con” leads to an even better performance. By only including the image reconstruction objective, “adv + con + im” provides a minor improvement. However, if we include the sentence reconstruction objective, our full captioning model achieves another significant improvement, with CIDEr value increasing from 49% to 54.9%. The reason is that the bi-directional image and sentence reconstruction can further leverage the unpaired data to encourage the generated caption to be semantically consistent with the image. The proposed method obtains significantly better results than [17], which may be because that the information in the COCO captions is more adequately exploited in our proposed method.

max width= Method B1 B2 B3 B4 M R C S Pivoting [17] 46.2 24.0 11.2 5.4 13.2 - 17.7 - Ours w/o init 53.8 35.5 23.1 15.6 16.6 39.9 46.7 9.6 Ours 58.9 40.3 27.0 18.6 17.9 43.1 54.9 11.1 con2sen 50.6 30.8 18.2 11.3 15.7 37.9 33.9 9.1 feat2sen 51.3 31.3 18.7 11.8 15.3 38.1 35.4 8.8 adv 55.6 35.5 23.1 15.7 17.0 40.8 45.8 10.1 adv + con 56.2 37.2 24.2 16.2 17.3 41.5 48.8 10.5 adv + con + im 56.4 37.5 24.5 16.5 17.4 41.6 49.0 10.5

Table 2: Performance comparisons on the test split [24] of the MSCOCO dataset under the unpaired setting.

5 Conclusion

In this paper, we proposed a novel method to train an image captioning model under an unsupervised manner without using any paired image-sentence data. As far as we know, this is the first attempt to investigate this problem. To achieve this goal, we proposed three training objectives, which encourage that 1) the generated captions are indistinguishable from sentences in the corpus, 2) the image captioning model conveys the object information in the image, and that 3) the image and sentence features are aligned in a common latent space and perform bi-directional reconstructions from each other. A large-scale image description corpus consisting of over 2 million sentences was further collected from Shutterstock to facilitate the proposed unsupervised image captioning method. The experimental results demonstrate that the proposed method can yield quite promising results without leveraging any labeled image-sentence pairs.


1 Word Clouds

Fig. 1 and Fig. 2 illustrate the word clouds generated using the sentences in the Shutterstock Image Description Corpus and the generated captions, respectively. It can be observed that some words appear more frequently than they do in the Shutterstock Image Description Corpus, e.g., “young”.

2 Lengths of Generated Captions

Fig. 3 shows the distribution of lengths of the generated captions. It can be observed that most of the generated captions consist of about eight words.

Figure 3: The distribution of lengths of generated captions.

3 More Qualitative Results

More qualitative results are illustrated in Fig. 4. The caption generated by “con2sen” only depends on the detected objects in the input image, while the other models generate captions conditioned on the input image features. For the first image, the sentence generated by “adv” is unrelated to the image because the adversarial objective only enforces the sentence to be genuine. After introducing the other objectives, the generated caption is more closely related to the image. “Ours w/o init” generates “helmet”, which does not appear in the image. However, the caption generated by “Ours” accurately describes the image content.

Figure 4: More qualitative results by the unsupervised captioning models trained with different objectives.

Fig. 5 illustrates some failure cases. In the first case, only “adv + con” recognizes that it is a “hotel” room. Most of other models regard it as a “bedroom”. The errors in the following two cases are inherited from the object detector. The “head” and “bear” are detected by error, which thereby affect the final generated sentences of most models.

Figure 5: The failure cases by the unsupervised captioning models trained with different objectives.