Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

by   Yuqing Song, et al.

Generating image descriptions in different languages is essential to satisfy users worldwide. However, it is prohibitively expensive to collect large-scale paired image-caption dataset for every target language which is critical for training descent image captioning models. Previous works tackle the unpaired cross-lingual image captioning problem through a pivot language, which is with the help of paired image-caption data in the pivot language and pivot-to-target machine translation models. However, such language-pivoted approach suffers from inaccuracy brought by the pivot-to-target translation, including disfluency and visual irrelevancy errors. In this paper, we propose to generate cross-lingual image captions with self-supervised rewards in the reinforcement learning framework to alleviate these two types of errors. We employ self-supervision from mono-lingual corpus in the target language to provide fluency reward, and propose a multi-level visual semantic matching model to provide both sentence-level and concept-level visual relevancy rewards. We conduct extensive experiments for unpaired cross-lingual image captioning in both English and Chinese respectively on two widely used image caption corpora. The proposed approach achieves significant performance improvement over state-of-the-art methods.


page 1

page 8


Fluency-Guided Cross-Lingual Image Captioning

Image captioning has so far been explored mostly in English, as most ava...

Unsupervised Cross-lingual Image Captioning

Most recent image captioning works are conducted in English as the major...

"Wikily" Neural Machine Translation Tailored to Cross-Lingual Tasks

We present a simple but effective approach for leveraging Wikipedia for ...

Cross-lingual Retrieval for Iterative Self-Supervised Training

Recent studies have demonstrated the cross-lingual alignment ability of ...

Self-Attention with Cross-Lingual Position Representation

Position encoding (PE), an essential part of self-attention networks (SA...

CUNI System for the WMT17 Multimodal Translation Task

In this paper, we describe our submissions to the WMT17 Multimodal Trans...

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important n...

1. Introduction

Generating natural language sentences to describe the image content, a.k.a image captioning, has received more and more attention in recent years. It could help visually impaired people to better understand the real world, and make it easier to index and retrieve massive images on the web. Thanks to the rapid development of computer vision and natural language generation, remarkable progress has been made in automatic image captioning. However, most of previous works have mainly focused on generating English captions for images. As we know, there are more than 6.6 billion non-native English speakers in the world, and the benefits of image captioning technology should also be brought to these users. Therefore, it is necessary to generate captions in different languages, which is also called the cross-lingual image captioning task.

Since image captioning models are generally data-hungry, the main challenge for cross-lingual image captioning is the lack of large-scale image caption dataset in the target language. It is also prohibitively expensive to collect dataset for each language. Fortunately, great efforts have already been made in collecting large-scale image-caption datasets in English, as well as machine translation datasets from English to other languages. Therefore, for cross-lingual image captioning, a natural way to avoid the demand of paired image-caption data in the target language is to employ another language, such as English, as the pivot to bridge the image and the target language (Gu et al., 2018), so that the image caption is first generated by an image-to-pivot captioning model, and then translated into the target language by a pivot-to-target machine translation (MT) model. Figure 1 illustrates the idea of utilizing English as pivot for cross-lingual Chinese image captioning.

Figure 1. Illustration of cross-lingual Chinese image captioning with English language as pivot. The translated Chinese caption in red suffers from disfluency errors while sentence in blue contains visual irrelevancy errors.

Case of cross-lingual image captioning

The major limitation of such language-pivoted approach is that translation errors brought by the pivot-to-target MT model cannot be corrected which seriously affects the quality of generated captions, especially when the MT model is trained in a different domain from image captions. In order to alleviate the domain mismatch, Gu et al. (Gu et al., 2018) propose to share parameters between the image-to-pivot captioning model and the pivot-to-target MT model, and jointly train these two models, which can enforce the MT model to adapt styles towards image captions. However, this approach is hard to generalize and computationally expensive to employ state-of-the-art MT models for pivot-to-target translation. Lan et al. (Lan et al., 2017) instead directly take advantage of the state-of-the-art translator 111Baidu online translation: http://api.fanyi.baidu.com to generate pseudo image-target caption pairs to train the image-to-target captioning model. They propose to re-weight the translated captions by the language fluency. However, in addition to the disfluent sentences, the imperfect translations may also contain fluent but visually irrelevant sentences as shown in Figure 1, which also greatly affect the accuracy of cross-lingual caption generation.

In this paper, we propose a self-supervised rewarding model (SSR) to deal with both disfluency and visual irrelevancy errors in language-pivoted unpaired image captioning. Our model is based on the reinforcement learning framework, which utilizes two types of rewards learned from self-supervisions to encourage the caption generator to correct above errors. Specifically, to improve the caption fluency, we propose a fluency reward based on a target language model, which is trained with self-supervision loss on mono-lingual sentences in the target language. In order to improve the visual relevancy, we propose a multi-level visual semantic matching model (ML-VSE) to provide relevancy rewards, which employs self-supervised pseudo image-target caption pairs from the pivot-to-target translation model for training. The ML-VSE model contains both sentence-level and concept-level visual semantic matching between images and captions, which provides coarse- and fine-grained rewards respectively. Extensive experiments on two widely used image caption datasets show that our model significantly outperforms prior works on all the caption performance metrics.

The main contributions of this work are summarized as follows:

  • We propose to employ the reinforcement learning framework to deal with errors in language-pivoted approaches for unpaired cross-lingual image captioning.

  • Introspective self-supervisions with respect to the fluency and visual relevancy of generated captions are designed as the rewards to improve the quality of cross-lingual captions.

  • Extensive experiments for both unpaired English image captioning and Chinese image captioning demonstrate that our proposed approach achieves significant improvement over previous methods on both objective caption metrics and human evaluation.

2. Related works

2.1. Image Caption Generation

Image caption generation is a challenging task which connects computer vision and natural language processing. With the rapid development in deep learning, great breakthroughs have been made in image captioning

(Fang et al., 2015; Jia et al., 2015; Vinyals et al., 2015; Hitschler et al., 2016; Zhang et al., 2017; Liu et al., 2018). Vinyals et al. (Vinyals et al., 2015) first propose an end-to-end image captioning model based on the encoder-decoder framework (Cho et al., 2014)

. A convolutional neural network (CNN)

(Kim, 2014)

is used to encode the image into a fix-dimensional feature vector and a recurrent neural network (RNN)

(Hochreiter and Schmidhuber, 1997)

is used as the decoder to generate captions based on the encoder output. The model is jointly optimized by maximizing the log probability of groundtruth descriptions.

Later, many improvements based on such encoder-decoder framework are proposed. Xu et al. (Xu et al., 2015) propose the spatial attention mechanism for image captioning, which divides the image into grids, and teaches the model to attend to the corresponding grid at each decoding step. Anderson et al. (Anderson et al., 2018) replace the grids with detected objects in a bottom-up attention to enhance the previous top-down attention method. You et al. (You et al., 2016) propose semantic attention which pre-defines a list of visual concepts to be attended to in the decoding step. Gu et al. (Gu et al., 2017) propose to explore both long-term and temporal information in captions with a CNN-based image captioning model. Recently, Biten et al. (Biten et al., 2019) propose to integrate contextual information into the captioning pipeline to deal with the out-of-vocabulary named entity generation.

Besides model structures, the training target also plays an important role in image captioning. The model trained by traditional maximum likelihood target suffers from exposure bias and evaluation mismatch. The exposure bias is caused by the training setting called “Teacher-Forcing” (Bengio et al., 2015), where the model has never been exposed to its own predictions in the training progress, which results in the error accumulation at test time. The evaluation mismatch exists because cross entropy is used as the training loss, but metrics such as BLEU, CIDEr and METEOR are instead used for caption performance evaluation. Therefore, reinforcement learning approaches are proposed to address these two problems. Rennie et al. (Rennie et al., 2017) propose a new training method based on reinforcement learning with a baseline reward called “Self-Critical”. They provide “reward” for captions sampled from model distribution, and the reward is directly evaluated by CIDEr. In order to enhance the stability of training, they use the reward of captions generated at test time as a baseline reward. Works in (Luo et al., 2018; Liu, 2018) propose to train the captioning model by providing rewards of discriminability to improve the diversity of generated captions. Our training strategy is similar to Rennie et al. (Rennie et al., 2017) except that we use self-supervision with respect to fluency and relevancy as rewards for model learning.

2.2. Cross-lingual Image Captioning

Cross-lingual image captioning is a more challenging captioning task which has not been well investigated yet, since most previous works have mainly focused on generating English captions. Tsutsui et al. (Tsutsui and Crandall, 2017) propose to generate image captions in Japanese by collecting a large-scale parallel image-caption dataset in Japanese. However, it may not be feasible for many languages due to the expensive cost of dataset collection. Feng et al. (Feng et al., 2018) propose an unsupervised image captioning model with a visual concept detector which is trained on Visual Genome dataset (Krishna et al., 2016). Although they do not need paired image-caption corpus, a large-scale dataset with images and grounded objects annotations is also difficult to collect in any language. Recently, cross-modal pivoted approaches are popularly used in solving zero-resource learning problems. Chen et al. (Chen et al., 2019b, a) propose to utilize images as pivots for zero-resource machine translation. While, Gu et al. (Gu et al., 2018) and Lan et al. (Lan et al., 2017) utilize language as pivot for cross-lingual image captioning. Gu et al. (Gu et al., 2018) propose to train the English image captioning model on images paired with Chinese captions and English-Chinese parallel translation pairs. The model is performed in two steps through language pivoting, which has an inherent deficiency due to translation error accumulation. Lan et al. (Lan et al., 2017) instead directly take advantage of the state-of-the-art translator to generate pseudo image-target caption pairs to train the captioning model. They propose to re-weight translated captions by language fluency to alleviate the disfluency errors brought about by the translator. However, in addition to disfluent sentences, the translation errors also contain fluent but visually irrelevant sentences, which are ignored in their works.

Figure 2. Illustration of the proposed SSR model framework, which consists of three components: a) the image captioning model trained on pseudo image-caption pairs; b) the language model to provide self-supervised fluency reward for the captioning model; c) the visual semantic matching model to provide self-supervised multi-level relevancy rewards. We add the English translation below the sampled Chinese caption in brackets for better understanding.

Framework of our model.

3. Unpaired Cross-lingual Image Captioning with Self-supervision

In this section, we will describe our self-supervised rewarding (SSR) model for unpaired cross-lingual image captioning. We first present the overview of the model framework in Section 3.1, which is based on reinforcement learning with two types of rewards to address the error accumulation problem in language-pivoted approaches. Then in Section 3.2 and Section 3.3, we describe the proposed self-supervised fluency and relevancy rewards in details.

3.1. Overview

The goal of unpaired cross-lingual image captioning is to generate a natural language sentence to describe the image content in the target language without image-target caption pairs for training. We tackle this problem via a pivot language with the supervision from the help of image-caption pairs in the pivot language and the pivot-to-target translation model. We refer to the pivot-to-target translation model as , and the image caption dataset in pivot language as , where refers to an image instance, refers to its corresponding sentence description in the pivot language, and is the total number of such image-caption pairs. Therefore, although we don’t have manually annotated image-caption pairs in the target language, we can generate pseudo pairs based on and where for training.

If the translation model is perfect, the pseudo pair can be used as groundtruth to train the target image captioning model in a supervised way. Thus the unpaired cross-lingual image captioning can be converted to a standard image captioning task. In this work, we employ the vanilla image captioning model based on an encoder-decoder framework (Vinyals et al., 2015). The encoder is a deep CNN (Kim, 2014) to encode the image to a fixed-dimensional feature vector . The decoder is a RNN (Hochreiter and Schmidhuber, 1997) to generate descriptions word by word conditioned on

. The whole model is optimized by maximizing the probability of generating each “groundtruth” caption words. The generation loss function can be expressed as:


where , n is the length of , is the sentence beginning signal ¡BOS¿, and is the parameters of the image caption model.

However, in reality, is not perfect and can produce different translation errors such as disfluent translations or visually irrelevant translations as shown in Figure 1. Such translation errors can greatly deteriorate the image captioning performance because the training supervision for the captioning model relies on the translated sentences. Therefore, extra supervisions are needed to mitigate the negative effects from , and provide accurate guidance for the caption generator. In this paper, we utilize reinforcement learning framework to improve the caption performance by providing various rewards.

In reinforcement learning framework, the caption generation can be seen as a sequence decision process. The decoder of the captioning model can be seen as an agent, and the generation of each word can be seen as an action taken by the agent in each step. When action decisions are finished, rewards will be fed back to the agent to “tell” how good these actions are. The objective of reinforcement learning is to maximize expected rewards in the end of decision. In order to address the disfluency and visual irrelevancy translation errors, we propose a fluency reward function in Section 3.2 and multi-level visual relevancy reward functions and in Section 3.3 to “tell” the captioning model how to improve the generated captions at both coarse and fine-grained levels. Specifically, we adopt the “self-critical” (Rennie et al., 2017) reinforcement learning algorithm to train our model. Firstly, we carry out Monte-Carlo sampling to sample a sentence and evaluate its caption quality with the proposed reward functions. Then we utilize the greedy search algorithm to generate a sentence to provide baseline reward for the stability of reinforcement training.

Therefore, the joint optimization loss function to train the image captioning model consists of three parts:


where and are the reinforcement learning objectives in fluency and relevancy aspects respectively; and are hyper-parameters, which are chosen according to the scale of these loss values and caption performance on the validation set. Figure 2 illustrates the overall framework of our proposed model.

3.2. Self-supervised Fluency Rewards

In order to improve the fluency quality of generated captions, we employ self-supervision from mono-lingual corpus in the target language to provide the fluency reward. We pre-train a language model on the mono-lingual corpus to evaluate the sentence fluency quality. We utilize the LSTM as our language model which is trained to maximize the probability of generating target sentence . Its loss function is expressed as:


where , n is the length of , and is the parameter of the language model.

For a sampled sentence where is the sentence length, we take the log probability of generating by the language model as its fluency reward as follows:


So the self-critical reinforcement loss function for fluency rewarding is formulated as:


3.3. Self-supervised Relevancy Rewards

Through the supervision from fluency reward, the caption model is “taught” to generate fluent captions in the target language. However, it cannot guarantee the generated captions are relevant to the given image, especially when the guidance from is wrong due to the semantically inconsistent translation errors. Therefore, extra relevancy reward is highly required to let the captioning model know what is relevant to the image content and what is not.

We propose to learn a visual semantic matching model to evaluate the relevancy of the generated captions to the image based on the pseudo image-target caption pairs . Although such pairs are noisy which might contain disfluent and visually irrelevant translation errors, there are also many content similar images whose descriptions are correctly translated, which enables accurate visual semantic matching. we call this relevancy reward computed by the visual semantic matching model as “self-supervised” reward as no annotated image-target caption pairs are used.

In order to further mitigate noises in the translated sentences, we propose a multi-level visual semantic matching model (ML-VSE) which includes the image-sentence matching at the coarse level and the image-concept matching at the fine-grained level. We use nouns and verbs in the caption sentence as the concepts, which play important roles to deliver semantic information of the sentence. The concepts in the pseudo pairs can be more accurate than sentences since concepts don’t suffer from disfluency errors and are easy to translate. We describe the sentence-level and concept-level relevancy rewards computed by the two visual semantic matching models in details below.

Sentence-Level Relevancy Reward. We provide the sentence-level relevancy reward via image-sentence matching. The image is encoded by which consists of a pre-trained CNN and a fully connected embedding layer to generate the image embedding vector . The caption sentence is encoded by which is a bi-directional GRU to generate the caption embedding vector . In order to project and in a common embedding space, we utilize the contrastive ranking loss with hard negative mining (Faghri et al., 2018) for training:


where severs as a margin hyper-parameter, , (, ) is a pseudo image-caption pair, is the negative caption given image , and is the negative image given caption in the mini-batch. The

means the similarity function between two embedded vectors, which is the cosine similarity in our experiments.

After training, the image-sentence matching model is able to give captions that are relevant to the image higher similarity scores than irrelevant ones. Therefore, our sentence-level visual relevancy reward for the generated caption of image is:


Concept-Level Relevancy Reward. Similarly to the image-sentence matching model, we utilize to encode the image to vector and encode the concept into the semantic vector with concept embedding matrix . The similar contrastive ranking loss is adopted to the joint image concept embedding space:


The trained image-concept matching model can be used to measure the relevancy of the visual concept and the image. However, the learned similarity score can be greatly influenced by the frequency statistics of concepts. The frequent concepts in pseudo pairs are more likely to obtain high scores than infrequent ones, which biases the captioning model towards frequent concepts. Hence, we normalize the similarity score by the prior probability of the concept in the dataset, so that the concept-level visual relevancy reward is computed as:


where is the image-concept pair extracted from the pseudo image-caption pairs, is the prior probability of which is its occurrence frequency, denotes whether the word is a visual concept, and is a hyper-parameter.

Therefore, our multi-level self-critical loss to improve the visual relevancy of generated captions is as follows:


The overall training process of the proposed self-supervised rewarding model is presented in Algorithm 1.

1:pivot image caption dataset ; pivot-to-target machine translation model ; target language sentence corpus ;
2:Generate pseudo image-target caption pairs based on and ;
3:Pre-train the target language model based on with Eq (3);
4:Pre-train in ML-VSE model based on with Eq (6) and Eq (8) respectively;
5:Initialize based on with Eq (1);
7:     select mini-batch ;
8:     generate for via Monte-Carlo sampling;
9:     generate for via greedy search;
10:     compute fluency self-critic loss for by Eq (5);
11:     compute relevancy self-critic loss for by Eq (10);
12:     update with Eq (2);
13:until  converges
Algorithm 1 Training algorithm of the proposed self-supervised rewarding model for unpaired cross-lingual image captioning.

4. Experiments

We evaluate the unpaired cross-lingual image captioning models in both English and Chinese languages. For unpaired English image captioning, we utilize Chinese as pivot; while for unpaired Chinese image captioning, we utilize English as pivot.

4.1. Evaluation Setting

Datasets. We conduct experiments on the MSCOCO (Lin et al., 2014) and AIC-ICC (Wu et al., 2017) image caption datasets in this work. The MSCOCO dataset is annotated in English, which consists of 123,287 images and 5 manually labeled English captions for each image. We follow the public split (Lin et al., 2014) which utilizes 113,287 images for training, 5,000 images for validation and 5,000 images for testing. The AIC-ICC (Image Chinese Captioning from AI Challenge) dataset contains 238,354 images and 5 manually annotated Chinese captions for each image. There are 208,354 and 30,000 images in the official training and validation set in AI challenge. Since annotations of the testing set are unavailable in the AIC-ICC dataset, we randomly sample 5,000 images from its validation set as our testing set. We use “Jieba” 222https://github.com/fxsjy/jieba to tokenize Chinese captions. The words with frequency more than 4 are added to our vocabulary. We truncate English captions longer than 20 and Chinese captions longer than 16. The statistics of the two datasets are presented in Table 1.

Dataset Lang. # Images # Captions # Vocabulary
AIC-ICC zh 240K 1200K 7,654
MSCOCO en 123K 615K 10,368
Table 1. Statistics of the datasets used in our experiments.

For unpaired English image captioning, the task is to generate captions in English for images from MSCOCO dataset while no English image-caption pairs are used. In this setting, we use Chinese as the pivot language and utilize the AIC-ICC Chinese image caption dataset. For unpaired Chinese image captioning, the task is to generate captions in Chinese for images from AIC-ICC dataset while no Chinese image-caption pairs are used. In this setting, we use English as the pivot language and utilize the MSCOCO English image caption dataset.

Compared Methods. We compare our proposed model with the following four baseline models:

  • Baseline: The vanilla captioning model (Vinyals et al., 2015) trained on pseudo pairs with cross-entropy in Eq (1) without any rewards.

  • Baseline+: The vanilla captioning model trained on pseudo pairs with CIDEr as reward in the reinforcement learning framework (Rennie et al., 2017).

  • 2-Stage pivot Google model (Gu et al., 2018): It utilizes a two-stage pipeline for unpaired cross-lingual image captioning. The image-to-pivot captioning model is the vanilla caption model (Vinyals et al., 2015) and the pivot-to-target MT model is the online Google translator.

  • 2-Stage pivot joint model (Gu et al., 2018): It utilizes a two-stage pipeline, including a image-to-pivot captioning model and pivot-to-target MT model. The two models share the same word embedding and are jointly trained to alleviate translation errors on the image caption domain.


We utilize the standard caption evaluation metrics to assess the quality of caption sentences, including BLEU

(Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and CIDEr (Vedantam et al., 2015). As an image tells a thousand words, above objective evaluation metrics may not be able to fully measure the caption quality from different aspects. We therefore carry out human evaluation to further assess the caption quality from the fluency and visual relevancy aspects.

4.2. Implementation Details

Image Captioning Model. We extract activations from the last pooling layer of ResNet-101 (He et al., 2016)

which is pre-trained on ImageNet as our image features. We encode the image feature to a 512-dimensional vector to initialize the hidden state of LSTM decoder. The LSTM decoder contains 1 layer with 512 hidden units. The dimensionality of the word embedding is set as 512. We use the special token ¡BOS¿ and ¡EOS¿ to represent the beginning and ending of sentences. At test time, a beam-search decoding with beam size of 10 is used to generate captions. We use the state-of-the-art Baidu translation API

333http://api.fanyi.baidu.com as our translation model .

Language Model for Fluency Rewards. For unpaired English image captioning, we use texts in the MSCOCO training set to train the English language model. For unpaired Chinese image captioning, we use texts in the AIC-ICC training set to train the Chinese language model. However, the mono-lingual corpus is not subject to these datasets. We do an ablation study in Section 4.4 to compare the performance of our SSR model with the language model trained on different corpus. The language model is a one-layer LSTM with 512 hidden units. After training, the language model is fixed to evaluate the fluency of target captions.

ML-VSE for Relevancy Rewards. For the image-sentence matching model, we use ResNet-101 (He et al., 2016) pre-trained on ImageNet as the CNN image encoder and one-layer bi-directional GRU with 512 hidden units as the sentence encoder. The dimensionality of image-sentence joint space is set to be 1024. For the image-concept matching model, we extract nouns and verbs as visual concepts from the translated captions via stanford parsing tools 444http://nlp.stanford.edu:8080/parser/index.jsp. In total, there are 3,231 visual concepts for unpaired English image captioning and 9,107 visual concepts for unpaired Chinese image captioning. The dimensionality of image-concept joint space is set as 512.

Training Details. We pre-train the image captioning model, language model and ML-VSE model using Adam optimizer (Kingma and Ba, 2015) with a batch size of 128. For the image captioning model, the initial learning rate is 4e-4, while for the language and ML-VSE model, the initial learning rate is 2e-4. In the self-critical reinforcement training, we set the learning rate as 4e-5 and batch size of 256. We set hyper-parameters , , and to 0.05, 0.15, 1.0 and 0.5 respectively. A dropout of 0.3 is applied to all models during training to prevent over-fitting.

Task Method Bleu@1 Bleu@2 Bleu@3 Bleu@4 Meteor CIDEr
Unpaired English
Image Captioning
Baseline 42.7 21.4 10.2 5.2 13.5 14.5
Baseline+ 44.0 22.0 10.5 5.3 13.0 14.6
2-Stage pivot Google model (Gu et al., 2018) 42.2 21.8 10.7 5.3 14.5 17.0
2-Stage pivot joint model (Gu et al., 2018) 46.2 24.0 11.2 5.4 13.2 17.7
2-8 Our SSR 52.0 30.0 17.9 11.1 14.2 28.2
Unpaired Chinese
Image Captioning
Baseline 41.1 23.9 13.0 7.1 21.1 11.5
Baseline+ 41.6 24.4 13.3 7.3 21.1 11.6
2-8 Our SSR 46.0 30.9 19.3 12.3 22.8 18.3
Table 2. Performance comparison with baseline methods for unpaired English image captioning evaluated on the MSCOCO dataset and unpaired Chinese image captioning evaluated on the AIC-ICC dataset.
Task Rewards Bleu@1 Bleu@2 Bleu@3 Bleu@4 Meteor CIDEr
Unpaired English
Image Captioning
No Reward 42.7 21.4 10.2 5.2 13.5 14.5
45.9 23.4 11.4 5.8 13.4 16.1
50.6 28.7 17.1 10.6 13.8 26.7
52.0 30.0 17.9 11.1 14.2 28.2
Unpaired Chinese
Image Captioning
No Reward 41.1 23.9 13.0 7.1 21.1 11.5
45.8 30.3 18.6 11.6 22.5 18.0
46.1 30.7 19.1 12.1 22.6 18.5
46.0 30.9 19.3 12.3 22.8 18.3
Table 3. The contribution of different rewards for unpaired cross-lingual image captioning on MSCOCO and AIC-ICC datasets.

4.3. Comparison with the State-of-the-arts

Table 2 presents the unpaired cross-lingual image captioning performance in English and Chinese languages from our proposed approach and the compared baselines. The proposed self-supervised rewarding (SSR) model achieves the best performance among all methods across different languages and evaluation metrics. The “Baseline” method trained with imperfect pseudo pairs is inferior to all other methods. It demonstrates that translation errors in pseudo pairs can significantly deteriorate the captioning performance even if we have utilized the state-of-the-art translation model. In the “Baseline+” method, although self-critical reinforcement learning algorithm is employed to train the model, the improvements over “Baseline” method is marginal since it directly utilizes the noisy translated captions to provide rewards. Our model instead is enhanced with fluency rewards and both coarse- and fine-grained visual relevancy rewards in the reinforcement learning framework. The comparison of our model with “Baseline+” proves that the contribution mainly comes from the proposed self-supervised rewards rather than the “self-critical” reinforcement training.

Our approach also outperforms the 2-stage models in Gu et al. (Gu et al., 2018). The 2-Stage pivot Google model takes the advantage of the state-of-the-art translation model but ignores the translation errors for unpaired image caption generation. The 2-Stage pivot joint model addresses the translation domain mismatch by joint training but cannot generalize to using the state-of-the-art translation model. To be noted, our model is also more efficient than the 2-stage models in the testing phase since we do not depend on the 2-stage pipeline for caption generation in the target language.

4.4. Ablation Studies

Contributions of different rewards. In Table 3, we ablate the unpaired captioning performance on different self-supervised rewards. The fluency reward alone improves the baseline method on both unpaired English and Chinese image captioning, which demonstrates that the proposed fluency reward can effectively improve the quality of generated caption sentences. However, the fluency reward only promotes the fluency of sentence without considering the visual relevancy. Combining the fluency reward with both sentence- and concept-level visual relevancy rewards achieves additional performance gains on both languages. We notice that the improvements of visual relevancy rewards are larger on unpaired English image captioning than the unpaired Chinese image captioning. Since the diversity of images in the Chinese pivot language AIC-ICC dataset is smaller than that in the MSCOCO dataset, the unpaired English image captioning trained on pseudo image-caption pairs on the AIC-ICC dataset is more likely to suffer from visual irrelevancy problems. Therefore, our proposed visual relevancy rewards can benefit more for unpaired English image captioning.

Image-to-Sentence Sentence-to-Image
R@1 R@10 R@1 R@10
AIC-ICC val 52.8 85.9 37.7 81.2
MSCOCO test 22.7 58.7 12.8 48.7
Table 4. Cross-modal retrieval performance using the proposed sentence-level semantic matching model trained on the self-supervised pseudo English pairs on the AIC-ICC training set. R@k represents recall in top k for the cross-modal retrieval.

Multi-level visual-semantic matching performance. We empirically evaluate the performance of ML-VSE model to demonstrate the reliability of the self-supervised relevancy rewards at the sentence- and concept-level. We take the ML-VSE model trained for English captioning as an example. We randomly select 1K images from the AIC-ICC validation set and MSCOCO testing set respectively to evaluate the performance of sentence-level semantic matching model, which is shown in Table 4. We notice that there exists a large performance gap between the MSCOCO testing set and AIC-ICC validation set, which can result from noises in pseudo pairs and image domain mismatch. Therefore, additional fine-grained relevancy reward is requisite. For the image-concept matching model, we visualize the top-10 predicted visual concepts for some images in the MSCOCO testing set in Figure 3. As we can see, the predicted visual concepts are highly relevant to the image content, which cover diverse aspects such as object, action and scene. Both results demonstrate the validity of our proposed relevancy guidance.

Figure 3. Top-10 predicted concepts for examples in MSCOCO test set.

Results of image-concept retrieval.

Corpus # Sents B@3 B@4 Meteor CIDEr
MSCOCO 565K 17.9 11.1 14.2 28.2
AIC-MT 483K 14.4 8.2 13.6 25.6
Table 5. English Image Captioning performance with language model trained on different mono-lingual corpus.
Figure 4. Examples of the English image captioning from the MSCOCO testing set, and Chinese image captioning from the AIC-ICC testing test. The errors in generated captions are marked in red.

Language model trained on different mono-lingual corpus. Although we utilize in-domain target corpus to train the language model in Table 2, our SSR model can also benefit from other out-of-domain mono-lingual corpora which are easier to obtain in reality. In Table 5, we present the unpaired English captioning performance with the language model trained on an out-of-domain corpus from AIC-MT 555https://challenger.ai/competition/ect2018. Though using the out-of-domain mono-lingual corpus is not as effective as using in-domain data, it still achieves significant improvements over baseline models in Table 2, which demonstrates the generalization ability of the proposed model to exploit different mono-lingual target corpora.

Comparison with paired target image captioning. Table 6 compares our proposed model with supervised mono-lingual image captioning models with different number of training pairs. We can see that the number of paired image-caption data is critical for the supervised image captioning model. Without sufficient pairs, the captioning performance drops significantly. Our model, however, relies on no supervised image-caption pairs, but achieves performance comparable to the supervised mono-lingual captioning model with 4,000 pairs.

Approach # Imgs # Caps B@4 Meteor CIDEr
Baseline (Vinyals et al., 2015) 82,783 414,113 27.7 23.3 83.9
40,000 40,000 24.2 21.8 71.0
10,000 10,000 20.6 18.8 54.6
4,000 4,000 14.0 14.2 28.5
3,000 3,000 10.7 12.6 19.1
Our SSR 0 0 11.1 14.2 28.2
Table 6. Comparison between unpaired English image captioning and supervised English image captioning with different number of training pairs from MSCOCO dataset.

4.5. Human Evaluation and Qualitative Results

Besides the quantitative evaluations in section 4.3, we also conduct human evaluation to verify the effectiveness of the proposed SSR model. We take the unpaired English image captioning as an example. We randomly select 1,000 images from the MSCOCO testing set, and recruit 10 workers who have sufficient English skills to evaluate the quality of generated captions from the “Baseline+” model and our SSR model. Particularly, we measure the caption quality in the fluency and relevancy aspects. The fluency levels consist of 1-very poor, 2-poor, 3-barely fluent, 4-fluent, and 5-human like, and the relevancy levels consist of 1-irrelevant, 2-basically irrelevant, 3-partial relevant, 4-relevant, and 5-completely relevant. Results in Table 7 demonstrate that our approach can generate more fluent and visually relevant captions than the baseline model with the guidance of self-supervised rewards. The example visualization results in Figure 4 for both English and Chinese image captioning also confirm this.

Measure Baseline+ model Our SSR model
Fluency 4.1 4.8
Relevancy 3.3 3.8
Table 7. Human evaluation results on the MSCOCO 1K test.

5. Conclusions

In this paper, we propose a novel language-pivoted approach for unpaired cross-lingual image captioning. Previous language-pivoted methods mainly suffer from translation errors brought about by the pivot-to-target translation model, such as disfluency and visually irrelevancy errors. We propose to alleviate negative effects from such errors by providing fluency and visual relevancy rewards as guidance in the reinforcement learning framework. We employ self-supervisions from mono-lingual sentence corpus and machine translated image-caption pairs to obtain the reward functions. Extensive experiments with both objective and human evaluations on both unpaired English and Chinese image captioning tasks demonstrate the effectiveness of the proposed approach.

This work was supported by National Natural Science Foundation of China (No. 61772535), Beijing Natural Science Foundation (No. 4192028), and National Key Research and Development Plan (No. 2016YFB1001202).


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: §2.1.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, Cited by: §2.1.
  • A. F. Biten, L. Gomez, M. cal Rusiñol, and D. Karatzas (2019) Good news, everyone! context driven entity-aware captioning for news images. In CVPR, Cited by: §2.1.
  • S. Chen, Q. Jin, and J. Fu (2019a) From words to sentences: a progressive learning approach for zero-resource machine translation with visual pivots. In IJCAI, Cited by: §2.2.
  • S. Chen, Q. Jin, and A. Hauptmann (2019b)

    Unsupervised bilingual lexicon induction from mono-lingual multimodal data

    In AAAI, Cited by: §2.2.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §2.1.
  • M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380. External Links: Document, Link Cited by: §4.1.
  • F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. In spotlight presentation at British Machine Vision Conference (BMVC), External Links: Link Cited by: §3.3.
  • H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig (2015) From captions to visual concepts and back. In CVPR, Cited by: §2.1.
  • Y. Feng, L. Ma, W. Liu, and J. Luo (2018) Unsupervised image captioning. In , External Links: Link Cited by: §2.2.
  • J. Gu, S. Joty, J. Cai, and G. Wang (2018) Unpaired image captioning by language pivoting. In Computer Vision – ECCV 2018, Cham, pp. 519–535. External Links: ISBN 978-3-030-01246-5 Cited by: §1, §1, §2.2, 3rd item, 4th item, §4.3, Table 2.
  • J. Gu, G. Wang, J. Cai, and T. Chen (2017) An empirical study of language cnn for image captioning. In ICCV, pp. 10. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2, §4.2.
  • J. Hitschler, S. Schamoni, and S. Riezler (2016) Multimodal pivots for image caption translation. In ACL, Cited by: §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Document Cited by: §2.1, §3.1.
  • X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars (2015) Guiding long-short term memory for image caption generation. In ICCV, Cited by: §2.1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. External Links: Link Cited by: §2.1, §3.1.
  • D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.2.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. Bernstein, and L. Fei-Fei (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. External Links: Link Cited by: §2.2.
  • W. Lan, X. Li, and J. Dong (2017) Fluency-guided cross-lingual image captioning. In ACM Multimedia, pp. 9. External Links: Document Cited by: §1, §2.2.
  • T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §4.1.
  • D. Liu, Z. Zha, H. Zhang, Y. Zhang, and F. Wu (2018) Context-aware visual policy network for sequence-level image captioning. In ACM Multimedia, External Links: Document Cited by: §2.1.
  • X. Liu (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In ECCV, Cited by: §2.1.
  • R. Luo, B. Price, S. Cohen, and G. Shakhnarovich (2018) Discriminability objective for training descriptive captions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §4.1.
  • S. J. Rennie, E. Marcheret, and Y. M. et al. (2017) Self-critical sequence training for image captioning. In CVPR, Cited by: §2.1, §3.1, 2nd item.
  • S. Tsutsui and D. Crandall (2017) Using artificial tokens to control languages for multilingual image caption generation.. Cited by: §2.2.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In CVPR, Cited by: §4.1.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, External Links: Document Cited by: §2.1, §3.1, 1st item, 3rd item, Table 6.
  • J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu, Y. Wang, and Y. Wang (2017) AI challenger : a large-scale dataset for going deeper in image understanding. CoRR abs/1711.06475. Cited by: §4.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention.. abs/1502.03044. External Links: Link Cited by: §2.1.
  • Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, Cited by: §2.1.
  • W. Zhang, B. Ni, Y. Yan, J. Xu, and X. Yang (2017) Depth structure preserving scene image generation. In ACM Multimedia, Cited by: §2.1.