The research on image captioning has made impressive progress in the past few years. Most of the proposed methods learn a deep neural network model to generate captions conditioned on an input image [8, 12, 15, 22, 23, 41, 42, 43]
. These models are trained in a supervised learning manner based on manually labeled image-sentence pairs, as illustrated in Figure1 (a). However, the acquisition of these paired image-sentence data is a labor intensive process. The scales of existing image captioning datasets, such as Microsoft COCO 
, are relatively small compared with image recognition datasets, such as ImageNet and OpenImage . The image and sentence varieties within these image captioning datasets are limited to be under 100 object categories. As a result, it is difficult for the captioning models trained on such paired image-sentence data to generalize to images in the wild . Therefore, how to relieve the dependency on the paired captioning datasets and make use of other available data annotations to generalize image captioning models is becoming increasingly important, and thus warrants deep investigations.
Recently, there have been several attempts at relaxing the reliance on paired image-sentence data for image captioning training. As shown in Figure 1 (b), Hendricks et al.  proposed to generate captions for novel objects, which are not present in the paired image-caption training data but exist in image recognition datasets, e.g., ImageNet. As such, novel object information can be introduced into the generated captioning sentence without additional paired image-sentence data. A thread of work [9, 44] proposed to transfer and generalize the knowledge learned in existing paired image-sentence datasets to a new domain, where only unpaired data is available, as shown in Figure 1 (c). In this way, no paired image-sentence data is needed for training a new image captioning model in the target domain. Recently, as shown in Figure 1 (d), Gu et al.  proposed to generate captions in a pivot language (Chinese) and then translate the pivot language captions to the target language (English), which requires no more paired data of images and target language captions. Chen et al.  proposed a semi-supervised framework for image captioning, which uses an external text corpus, shown in Figure 1 (d), to pre-train their image captioning model. Although these methods have achieved improved results, a certain amount of paired image-sentence data is indispensable for training the image captioning models.
To the best of our knowledge, no work has explored unsupervised image captioning, i.e., training an image captioning model without using any labeled image-sentence pairs. Figure 1 (f) shows this new scenario, where only one image set and one external sentence corpus are used in an unsupervised training setting, which, if successful, can dramatically reduce the labeling work required to create a paired image-sentence dataset. However, it is very challenging to figure out how we can leverage the independent image set and sentence corpus to train a reliable image captioning model.
Recently, several models, relying on only monolingual corpora, have been proposed for unsupervised neural machine translation[4, 28]
. The key idea of these methods is to map the source and target languages into a common space by a shared encoder with cross-lingual embeddings. Compared with unsupervised machine translation, unsupervised image captioning is even more challenging. The images and sentences reside in two modalities with significantly different characteristics. Convolutional neural network (CNN)
usually acts as an image encoder, while recurrent neural network (RNN) is naturally suitable for encoding sentences. Due to their different structures and characteristics, the encoders of image and sentence cannot be shared, as in unsupervised machine translation.
In this paper, we make the first attempt to train image captioning models without any labeled image-sentence pairs. Specifically, three key objectives are proposed. First, we train a language model on the sentence corpus using the adversarial text generation method, which generates a sentence conditioned on a given image feature. As illustrated in Figure 1 (f), we do not have the ground-truth caption of a training image in the unsupervised setting. Therefore, we employ adversarial training  to generate sentences such that they are indistinguishable from the sentences within the corpus. Second, in order to ensure that the generated captions contain the visual concepts in the image, we distill  the knowledge provided by a visual concept detector into the image captioning model. Specifically, a reward will be given when a word, which corresponds to the detected visual concepts in the image, appears in the generated sentence. Third, to encourage the generated captions to be semantically consistent with the image, the image and sentence are projected into a common latent space. Given a projected image feature, we can decode a caption, which can further be used to reconstruct the image feature. Similarly, we can encode a sentence from the corpus to the latent space feature and thereafter reconstruct the sentence. By performing bi-directional reconstruction, the generated sentence is forced to closely represent the semantic meaning of the image, in turn improving the image captioning model.
Moreover, we develop an image captioning model initialization pipeline to overcome the difficulties of training from scratch. We first take the concept words in a sentence as input and train a concept-to-sentence model using the sentence corpus only. Next, we use the visual concept detector to recognize the visual concepts present in an image. Integrating these two components together, we are able to generate a pseudo caption for each training image. The pseudo image-sentence pairs are used to train a caption generation model in the standard supervised manner, which then serves as the initialization for our image captioning model.
In summary, our contributions are four-fold:
We make the first attempt to conduct unsupervised image captioning without relying on any labeled image-sentence pairs.
We propose three objectives to train the image captioning model. First, adversarial training is used to generate one sentence description for a given image without resorting to its ground-truth caption. Second, we distill the knowledge from a visual concept detector into the image captioning model. Third, we perform the alignment and bi-directional reconstructions between the image and sentence to encourage the generated sentence to be more semantically correlated with the given image.
We propose a novel model initialization pipeline exploiting unlabeled data. By leveraging the visual concept detector, we generate a pseudo caption for each image and initialize the image captioning model using the pseudo image-sentence pairs.
We crawl a large-scale image description corpus consisting of over 2 million sentences from the Web for the unsupervised image captioning task. Our experimental results demonstrate the effectiveness of our proposed model in producing quite promising image captions. We also compare the proposed method against  under the same unpaired image captioning setting, achieving the superior performance.
2 Related Work
2.1 Image Captioning
Supervised image captioning has been extensively studied in the past few years. Most of the proposed models use one CNN to encode an image and one RNN to generate a sentence describing the image 
, respectively. These models are trained to maximize the probability of generating the ground-truth caption conditioned on the input image. As paired image-sentence data is expensive to collect, some researchers tried to leverage other data available to improve the performances of image captioning models. Andersonet al.  trained an image caption model with partial supervision. Incomplete training sequences are represented by finite state automaton, which can be used to sample complete sentences for training. Chen et al.  developed an adversarial training procedure to leverage unpaired data in the target domain. Although improved results have been obtained, the novel object captioning or domain adaptation methods still need paired image-sentence data for training. Gu et al.  proposed to first generate captions in a pivot language and then translate the pivot language caption to the target language. Although no image and target language caption pairs are used, their method depends on image-pivot pairs and a pivot-target parallel translation corpus. In contrast to the methods aforementioned, our proposed method does not need any paired image-sentence data.
2.2 Unsupervised Machine Translation
Unsupervised image captioning is similar in spirit to unsupervised machine translation, if we regard the image as the source language. In the unsupervised machine translation methods [4, 28, 29], the source language and target language are mapped into a common latent space so that the sentences of the same semantic meanings in different languages can be well aligned and the following translation can thus be performed. However, the unsupervised image captioning task is more challenging because images and sentences reside in two modalities with significantly different characteristics.
3 Unsupervised Image Captioning
Unsupervised image captioning relies on a set of images , a set of sentences , and an existing visual concept detector, where and are the total numbers of images and sentences, respectively. Please note that the sentences are obtained from an external corpus, which is not related to the images. For simplicity, we will omit the subscripts and use and to represent an image and a sentence, respectively. In the following, we first describe the architecture of our image captioning model. Afterwards, we will introduce how to perform the training based on the given data.
3.1 The Model
As shown in Figure 2, our proposed image captioning model consists of an image encoder, a sentence generator, and a sentence discriminator.
Encoder. One image CNN encodes the input image into one feature representation :
, acting as the generator, decodes the obtained image representation into a natural sentence to describe the image content. At each time-step, the LSTM outputs a probability distribution over all the words in the vocabulary conditioned on the image feature and previously generated words. The generated word is sampled from the vocabulary according to the obtained probability distribution:
where FC and denote the fully-connected layer and sampling operation, respectively. is the length of the generated sentence with denoting the word embedding matrix. , , , and
are the LSTM input, a one-hot vector representation of the generated word, LSTM hidden state, and the probability over the dictionary at the-th time step, respectively. and denote the start-of-sentence (SOS) and end-of-sentence (EOS) tokens, respectively. is initialized with zero. For unsupervised image captioning, the image is not accompanied by sentences describing its content. Therefore, one key difference between our generator and the sentence generator in  is that is sampled from the probability distribution , while the LSTM input word is from the ground-truth caption during training in .
Discriminator. The discriminator is also implemented as an LSTM, which tries to distinguish whether a partial sentence is a real sentence from the corpus or is generated by the model:
where is the hidden state of the LSTM. indicates the probability that the generated partial sentence is regarded as real by the discriminator. Similarly, given a real sentence from the corpus, the discriminator outputs , where is the length of . is the probability that the partial sentence with the first words in is deemed as real by the discriminator.
As we do not have any paired image-sentence data available, we cannot train our model in the supervised learning manner. In this paper, we define three novel objectives to make unsupervised image captioning possible.
3.2.1 Adversarial Caption Generation
The sentences generated by the image captioning model need to be plausible to human readers. Such a goal is usually ensured by training a language model on a sentence corpus. However, as discussed before, the supervised learning approaches cannot be used to train the language model in our setting. Inspired by the recent success of the adversarial text generation method , we employ the adversarial training  to ensure the plausible sentence generation. The generator takes an image feature as input and generates one sentence conditioned on the image feature. The discriminator distinguishes whether a sentence is generated by the model or is a real sentence from the corpus. The generator tries to fool the discriminator by generating sentences as real as possible. To achieve this goal, we give the generator a reward at each time-step and name this reward as adversarial reward. The reward value for the
-th generated word is the logarithm of the probability estimated by the discriminator:
By maximizing the adversarial reward, the generator gradually learns to generate plausible sentences. For the discriminator, the corresponding adversarial loss is defined as:
3.2.2 Visual Concept Distillation
The adversarial reward only encourages the model to generate plausible sentences following grammar rules, which may be irrelevant to the input image. In order to generate relevant image captions, the captioning model must learn to recognize the visual concepts in the image and incorporate such concepts into the generated sentence. Therefore, we propose to distill the knowledge from an existing visual concept detector into the image captioning model. Specifically, when the image captioning model generates a word whose corresponding visual concept is detected in the input image, we give a reward to the generated word. Such a reward is called a concept reward, with the reward value indicated by the confidence score of that visual concept. For an image , the visual concept detector outputs a set of concepts and corresponding confidence scores: , where is the -th detected visual concept, is the corresponding confidence score, and is the total number of detected visual concepts. The concept reward assigned to the -th generated word is given by:
where is the indicator function.
3.2.3 Bi-directional Image-Sentence Reconstruction
With the adversarial training and concept reward, the captioning quality would be largely determined by the visual concept detector because it is the only bridge between images and sentences. However, the existing visual concept detectors can only reliably detect a limited number of object concepts. The image captioning model should understand more semantic concepts of the image for a better generalization ability. To achieve this goal, we propose to project the images and sentences into a common latent space such that they can be used to reconstruct each other. Consequently, the generated caption would be semantically consistent with the image.
Image Reconstruction. The generator produces a sentence conditioned on an image feature, as shown in Figure 3 (a). The sentence caption should contain the gist of the image. Therefore, we can reconstruct the image from the generated sentence, which can encourage the generated captions to be semantically consistent with the image. However, one hurdle for doing so lies in that it is very difficult to generate images containing complex objects, e.g., people, of high-resolution using current techniques [6, 7, 25]. Therefore, in this paper, we turn to reconstruct the image features instead of the full image. As shown in Figure 3 (a), the discriminator can be viewed as a sentence encoder. A fully-connected layer is stacked on the discriminator to project the last hidden state to the common latent space for images and sentences:
where can be further viewed as the reconstructed image feature from the generated sentence. Therefore, we define an additional image reconstruction loss for training the discriminator:
It can also be observed that the generator together with the discriminator constitutes the image reconstruction process. Therefore, an image reconstruction reward for the generator, which is proportional to the negative reconstruction error, can be defined as:
Sentence Reconstruction. Similarly, as shown in Figure 3 (b), the discriminator can encode one sentence and project it into the common latent space, which can be viewed as one image representation related to the given sentence. The generator can reconstruct the sentence based on the obtained representation. Such a sentence reconstruction process could also be viewed as a sentence denoising auto-encoder . Besides aligning the images and sentences in the latent space, it also learns how to decode a sentence from an image representation in the common space. In order to make a reliable and robust sentence reconstruction, we add noises to the input sentences by following . The objective of the sentence reconstruction is defined as the cross-entropy loss:
where is the -th word in sentence .
The three objectives are jointly considered to train our image captioning model. For the generator, as the word sampling operation is not differentiable, we train the generator using policy gradient , which estimates the gradients with respect to trainable parameters given the joint reward. More specifically, the joint reward consists of adversarial reward, concept reward, and image reconstruction reward. Besides the gradients estimated by policy gradient, the sentence reconstruction loss also provides gradients for the generator via back-propagation. These two types of gradients are both employed to update the generator. Let denote the trainable parameters in the generator. The gradient with respect to is given by:
where is a decay factor, and is the baseline reward estimated using self-critic . , and are the hyper-parameters controlling the weights of different terms.
For the discriminator, the adversarial and image reconstruction losses are combined to update the parameters via gradient descent:
During the training process, the generator and discriminator are updated alternatively.
It is challenging to adequately train our image captioning model from scratch with the given unpaired data, even with the proposed three objectives. Therefore, we propose an initialization pipeline to pre-train the generator and discriminator.
Regarding the generator, we would like to generate a pseudo caption for each training image, and then use the pseudo image-caption pairs to initialize an image captioning model. Specifically, we first build a concept dictionary consisting of the object classes in the OpenImages dataset . Second, we train a concept-to-sentence (con2sen) model using the sentence corpus only. Given a sentence, we use a one-layer LSTM to encode the concept words within the sentence into a feature representation, and use another one-layer LSTM to decode the representation into the whole sentence. Third, we detect the visual concepts in each image using the existing visual concept detector. With the detected concepts and the concept-to-sentence model, we are able to generate a pseudo caption for each image. Fourth, we train the generator with the pseudo image-caption pairs using the standard supervised learning method as in . Such an image captioner is named as feature-to-sentence (feat2sen) and used to initialize the generator.
Regarding the discriminator, parameters are initialized by training an adversarial sentence generation model on the sentence corpus.
In this section, we evaluate the effectiveness of our proposed method. To quantitatively evaluate our unsupervised captioning method, we use the images in the MSCOCO dataset  as the image set (excluding the captions). The sentence corpus is collected by crawling the image descriptions from Shutterstock111https://www.shutterstock.com. The object detection model  trained on OpenImages  is used as the visual concept detector. We first introduce sentence corpus crawling and experimental settings. Next, we present the performance comparisons as well as the ablation studies.
4.1 Shutterstock Image Description Corpus
We collect a sentence corpus by crawling the image descriptions from Shutterstock for the unsupervised image captioning research. Shutterstock is an online stock photography website, which provides hundreds of millions of royalty-free stock images. Each image is uploaded with a description written by the image composer. Some images and description samples are shown in Figure 4. We hope that the crawled image descriptions are, to be somewhat, related to the training images. Therefore, we directly use the name of the eighty object categories in the MSCOCO dataset as the searching keywords. For each keyword, we download the search results of the top one thousand pages. If the number of pages available is less than one thousand, we will download all the results. There are roughly a hundred images in one page, resulting in descriptions for each object category. After removing the sentences with less than eight words, we collect distinct image descriptions in total.
4.2 Experimental Settings
Following , we split the MSCOCO dataset: 113,287 images for training, 5,000 images for validation, and the remaining 5,000 images for testing. Please note that the training images are used to build the image set, with the corresponding captions left unused for any training. All the descriptions in the Shutterstock image description corpus are tokenized by the NLTK toolbox . We build a vocabulary by counting all the tokenized words and removing the words with frequencies lower than 40. The object category names of the used object detection model are then merged into the vocabulary. Finally, there are words in our vocabulary, including special SOS, EOS, and an Unkown token. We perform a further filtering process by removing the sentences containing more than 15% Unknown tokens. After filtering, we retain sentences.
The LSTM hidden dimension and the shared latent space dimension are both fixed to 512. The weighting hyper-parameters are chosen to make different rewards roughly at the same scale. Specifically, , , and are set to be 10, , and 1, respectively. is set to be . We train our model using the Adam optimizer  with a learning rate of 0.0001. During the initialization process, we minimize the cross-entropy loss using Adam with the learning rate 0.001. When generating the captions in the test phase, we use beam search with a beam size of 3.
4.3 Experimental Results and Analysis
The top region of Table 1 illustrates the unsupervised image captioning results on the test split of the MSCOCO dataset. The captioning model obtained with the proposed unsupervised training method achieves promising results, with CIDEr as 23.3%. Moreover, we also report the results of training our model from scratch (“Ours w/o init”) to verify the effectiveness of our proposed initialization pipeline. Without initialization, the CIDEr value drops to 20.9%, which shows that the initialization pipeline can benefit the model training and thus boost image captioning performances.
Ablation Studies. The results of the ablation studies are illustrated in the bottom region of Table 1. It can be observed that “con2sen” and “feat2sen” generate reasonable results with CIDEr as 17.6% and 19.5%, respectively. As such, “con2sen” can be used to generate pseudo image-caption pairs for training “feat2sen”. And “feat2sen” can make a meaningful initialization of the generator of our captioning model.
When only the adversarial objective is introduced to train the captioning model, “adv” alone leads to much worse results. One cause for this is due to the linguistic characteristics of the crawled image descriptions from Shutterstock, which are significantly different from those of the COCO captions. Another cause is that the adversarial objective only enforces genuine sentence generation but does not ensure its semantic correlation with the image content. Because of the linguistic characteristic difference, most metrics also drop even after introducing the concept objective in “adv + con” and further incorporating the image reconstruction objective in “adv + con + im”. Although the generated sentences of these two baselines may look plausible, the evaluation results with respect to the COCO captions are not satisfactory. However, by considering all the objectives, our proposed method substantially improves the captioning performances.
Qualitative Results. Figure 5 shows some qualitative results of unsupervised image captioning. In the top-left image, the object detector fails to detect “laptop”, so the “con2sen” model says nothing about the laptop. On the contrary, the other models successfully recognize the laptop and incorporate such concept into the generated caption. In the top-right image, only a small region of the cat is visible. With such a small region, our full captioning model recognizes that it is “a black and white cat”. The object detector cannot provide any information about color attributes. We are pleased to see that the bi-directional reconstruction objective is able to guide the captioning model to recognize and express such visual attributes in the generated description sentence. In the bottom two images, “vehicle” and “hat” are detected by error, which severely affects the results of “con2sen”. On the contrary, after training the captioning model with the proposed objectives, the captioning model is able to correct such errors and generate plausible captions333More qualitative results can be found in the supplemental materials..
Effect of Concept Reward. Figure 6 shows the average number of correct concept words in each sentence generated during the training process. It can be observed that the number of “adv” drops quickly in the beginning. The reason is that the adversarial objective is not related to the visual concepts in the image. “Ours w/o init” continuously increases from zero to about 0.6. The concept reward consistently improves the ability of the captioning model to recognize visual concepts. For “adv + con”, “adv + con + im”, and “Ours”, the number is about 0.8. One reason is that the initialization pipeline gives a good starting point. Another possible reason is that the concept reward prevents the captioning model from drifting towards degradation.
4.4 Performance Comparisons under the Unpaired Captioning Setting
The performances of the unsupervised captioning models may seem unsatisfactory in terms of the evaluation metrics on the COCO test split. This is mainly due to the different linguistic characteristics between COCO captions and crawled image descriptions. To further demonstrate the effectiveness of the proposed three objectives, we compare with  under the same unpaired captioning setting, where the COCO captions of the training images are used but in an unpaired manner. Specifically, we replace the crawled sentence corpus with the COCO captions of the training images. All the other settings are kept the same as the unsupervised captioning setting. A new vocabulary with words is created by counting all the words in the training captions and removing the words with frequencies less than 4.
The results of unpaired image captioning are shown in Table 2. It can be observed that the captioning model can be consistently improved based on the unpaired data, by including the three proposed objectives step by step. Due to exposure bias , some of the captions generated by “feat2sen” are poor sentences. The adversarial objective encourages these generated sentences to appear genuine, resulting in improved performances. Through only adversarial training, the model tends to generate sentences unrelated to the image. This issue is mitigated by the concept reward and thus “adv + con” leads to an even better performance. By only including the image reconstruction objective, “adv + con + im” provides a minor improvement. However, if we include the sentence reconstruction objective, our full captioning model achieves another significant improvement, with CIDEr value increasing from 49% to 54.9%. The reason is that the bi-directional image and sentence reconstruction can further leverage the unpaired data to encourage the generated caption to be semantically consistent with the image. The proposed method obtains significantly better results than , which may be because that the information in the COCO captions is more adequately exploited in our proposed method.
In this paper, we proposed a novel method to train an image captioning model under an unsupervised manner without using any paired image-sentence data. As far as we know, this is the first attempt to investigate this problem. To achieve this goal, we proposed three training objectives, which encourage that 1) the generated captions are indistinguishable from sentences in the corpus, 2) the image captioning model conveys the object information in the image, and that 3) the image and sentence features are aligned in a common latent space and perform bi-directional reconstructions from each other. A large-scale image description corpus consisting of over 2 million sentences was further collected from Shutterstock to facilitate the proposed unsupervised image captioning method. The experimental results demonstrate that the proposed method can yield quite promising results without leveraging any labeled image-sentence pairs.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In ECCV, pages 382–398, 2016.
-  P. Anderson, S. Gould, and M. Johnson. Partially-supervised image captioning. arXiv preprint arXiv:1806.06004, 2018.
-  L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, J. Mao, J. Huang, A. Toshev, O. Camburu, et al. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016.
-  M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041, 2017.
-  S. Bird, E. Klein, and E. Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
-  A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
-  L. Chen, H. Zhang, J. Xiao, W. Liu, and S.-F. Chang. Zero-shot visual recognition using semantics-preserving adversarial embedding network. In CVPR, volume 2, 2018.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, pages 6298–6306, 2017.
-  T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun. Show, adapt and tell: Adversarial training of cross-domain image captioner. In ICCV, volume 2, 2017.
-  W. Chen, A. Lucchi, and T. Hofmann. A semi-supervised framework for image captioning. arXiv preprint arXiv:1611.05321, 2016.
-  X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
-  X. Chen, L. Ma, W. Jiang, J. Yao, and W. Liu. Regularizing rnns for caption generation by reconstructing the past with the present. In CVPR, 2018.
-  M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In ACL workshop, pages 376–380, 2014.
-  W. Fedus, I. Goodfellow, and A. M. Dai. Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736, 2018.
-  Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, volume 2, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
-  J. Gu, S. Joty, J. Cai, and G. Wang. Unpaired image captioning by language pivoting. In ECCV, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.
-  W. Jiang, L. Ma, X. Chen, H. Zhang, and W. Liu. Learning to guide decoding for image captioning. arXiv preprint arXiv:1804.00887, 2018.
-  W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang. Recurrent fusion network for image captioning. In ECCV, pages 510–526, 2018.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html, 2017.
-  G. Lample, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
-  G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755, 2018.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
-  M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In CVPR, volume 1, page 3, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015.
R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour.
Policy gradient methods for reinforcement learning with function approximation.In NIPS, pages 1057–1063, 2000.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and C. Sienkiewicz. Rich image captioning in the wild. In CVPR workshop, pages 49–56, 2016.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In ICML, pages 1096–1103, 2008.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
-  T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. In ICCV, pages 22–29, 2017.
-  Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, pages 4651–4659, 2016.
-  W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao. Dual learning for cross-domain image captioning. In CIKM, pages 29–38, 2017.
1 Word Clouds
2 Lengths of Generated Captions
Fig. 3 shows the distribution of lengths of the generated captions. It can be observed that most of the generated captions consist of about eight words.
3 More Qualitative Results
More qualitative results are illustrated in Fig. 4. The caption generated by “con2sen” only depends on the detected objects in the input image, while the other models generate captions conditioned on the input image features. For the first image, the sentence generated by “adv” is unrelated to the image because the adversarial objective only enforces the sentence to be genuine. After introducing the other objectives, the generated caption is more closely related to the image. “Ours w/o init” generates “helmet”, which does not appear in the image. However, the caption generated by “Ours” accurately describes the image content.
Fig. 5 illustrates some failure cases. In the first case, only “adv + con” recognizes that it is a “hotel” room. Most of other models regard it as a “bedroom”. The errors in the following two cases are inherited from the object detector. The “head” and “bear” are detected by error, which thereby affect the final generated sentences of most models.