Unpaired Image Captioning via Scene Graph Alignments

by   Jiuxiang Gu, et al.
Nanyang Technological University

Deep neural networks have achieved great success on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire in most real-world scenarios. In this paper, we propose a scene graph based approach for unpaired image captioning. Our method merely requires an image set, a sentence corpus, an image scene graph generator, and a sentence scene graph generator. The sentence corpus is used to teach the decoder how to generate meaningful sentences from a scene graph. To further encourage the generated captions to be semantically consistent with the image, we employ adversarial learning to align the visual scene graph to the textual scene graph. Experimental results show that our proposed model can generate quite promising results without using any image-caption training pairs, outperforming existing methods by a wide margin.


Unsupervised Image Captioning

Deep neural networks have achieved great successes on the image captioni...

Similar Scenes arouse Similar Emotions: Parallel Data Augmentation for Stylized Image Captioning

Stylized image captioning systems aim to generate a caption not only sem...

Recurrent Relational Memory Network for Unsupervised Image Captioning

Unsupervised image captioning with no annotations is an emerging challen...

Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning

The task of image captioning aims to generate captions directly from ima...

Unsupervised Cross-lingual Image Captioning

Most recent image captioning works are conducted in English as the major...

Comprehensive Image Captioning via Scene Graph Decomposition

We address the challenging problem of image captioning by revisiting the...

Improving Image Captioning by Concept-based Sentence Reranking

This paper describes our winning entry in the ImageCLEF 2015 image sente...

1 Introduction

Today’s image captioning models heavily depend on paired image-caption datasets. Most of them employ an encoder-decoder framework [32, 6, 10, 9]

, which uses a convolutional neural network (CNN) 


to encode an image into a feature vector and then a recurrent neural network (RNN) to decode it into a text sequence. However, it is worthwhile noticing that the overwhelming majority of image captioning studies are conducted in English 

[1]. The bottleneck is the lack of large scale image-caption paired datasets in other languages, and getting such paired data for each target language requires human expertise in a time-consuming and labor-intensive process.

Figure 1: Illustration of our graph-based learning method. Our model consists of one visual scene graph detector (Top-Left), one fixed off-the-shelf scene graph language parser (Bottom-Left), a scene graph encoder , a sentence decoder , and a feature mapping module.

Several encoder-decoder models have been proposed in recent years for unsupervised neural machine translation  

[20, 4]. The key idea of these methods mainly relies on training denoising auto-encoders for language modeling and on sharing latent representations across the source and target languages for the encoder and the decoder. Despite the promising results achieved by the unsupervised neural machine translation, unpaired image-to-sentence translation is far from mature.

Recently, there have been few attempts at relaxing the requirement of paired image-caption data for this task. The first work in this direction is the pivot-based semi-supervised solution proposed by Gu  [11], where they take a pivot language as a bridge to connect the source image and the target language caption. Their method requires an image-text paired data for the pivot language (Chinese), and a parallel corpus for pivot to target translation. Feng  [7] moves a step further, where they conduct purely unsupervised image captioning without relying on any labeled image-caption pairs. Their method uses a sentence discriminator along with a visual concept detector to connect the image and the text modalities through adversarial training. Although promising, the results of the existing methods are still far below compared to their paired counterparts.

Unlike unsupervised neural machine translation where the encoders can be shared across source and target languages, due to the different structures and characteristics of image and text modalities, the encoders of image and sentence cannot be shared to connect the two modalities. The critical challenge in unpaired image captioning is therefore the gap of information misalignment in images and sentences, so as to fit the encoder-decoder framework.

Fortunately, with recent breakthroughs in deep learning and image recognition, higher-level visual understanding tasks such as scene graph construction have become popular research topics with major advancements [35, 14]. Scene graphs, as an abstraction of objects and their complex relationships, provide rich semantic information of an image. The value of scene graph representation has been proven in a wide range of vision-language tasks, such as visual question answering [30] and paired image captioning [35].

Considering the significant challenges that unpaired image captioning problem poses in terms of different characteristics between visual and textual modalities, in this paper, we propose a scene graph based method that exploits the rich semantic information captured by scene graphs. Our framework comprises an image scene graph generator, a sentence scene graph generator, a scene graph encoder, a sentence decoder, and a feature alignment module that maps the features from image to sentence modality. Figure 1 sketches our solution. We first extract the sentence scene graphs from the sentence corpus and train the scene graph encoder and the sentence decoder on the text modality. To align the scene graphs between images and sentences, we use CycleGAN [39] to build the data correspondence between the two modalities. Specifically, given the unrelated image and sentence scene graphs, we first encode them with the scene graph encoder trained on the sentence corpus. Then, we perform unsupervised cross-modal mapping for feature level alignments with CycleGAN. By mappings the features, the encoded image scene graph is pushed close to the sentence modality, which is then used effectively as input to the sentence decoder to generate meaningful sentences.

The main contributions of this work include: (1) a novel scene graph based framework for unpaired image captioning; (2) an unsupervised feature alignment method that learns the cross-modal mapping without any paired data. Our experimental results demonstrate the effectiveness of our proposed model in producing quite promising image captions. The comparison with recent unpaired image captioning methods validates the superiority of our method.

2 Background

Paired Image Captioning. Image captioning has been extensively studied in the past few years. Most of the existing approaches are under the paired setting, that is, input images come with their corresponding ground-truth captions [32, 16, 37, 10, 9, 36]. One classic work in this setting is [32]

, in which an image is encoded with a CNN and the sentence is decoded with a Long Short-Term Memory (LSTM). Following this, many methods have been proposed to improve this encoder-decoder method. One of the notable improvements is the attention mechanism 

[34, 10, 3], which allows the sentence decoder to dynamically focus on some related image regions during the caption generation process. Some other works explore other architectures for language modeling [12, 29]. For example, Gu  [12]

introduces a CNN-based language model for image captioning. Another theme of improvements is to use reinforcement learning (RL) to address the exposure bias and loss-evaluation mismatch problems for sequence prediction 

[26, 10]. The self-critical learning approach proposed in [26] is a pioneering work, which well addresses the above two problems.

Our work in this paper is closely related to [35], which uses scene graph to connect images and sentences to incorporate inductive language bias. The key difference is that the framework in [35] is based on the paired setting, while in this work we learn the scene graph based network under unsupervised training setting.

Unpaired Image Captioning. More recently, some researchers started looking into the problem of image captioning in the unpaired setting [11, 7], where there is no correspondence between images and sentences during training. The first work on this task is the pivot-based solution proposed by Gu  [11]. In their setting, although they do not have any correspondence between images and sentences in target language, they do require a paired image-caption dataset in the pivot language and another machine translation dataset which consists of sentences in the pivot language and the paired sentences in target language. They connect pivot language sentences in different domains by shared word embeddings. The most recent work on unpaired image captioning is proposed by Fang  [7]. They generate pseudo image-sentence pairs by feeding the visual concepts of images to a concept-to-sentence model and performing the alignment between image features and sentence features in an adversarial manner.

While several attempts have been made for the unpaired image captioning problem, this challenging task is far from mature. Arguably, compared to unpaired sentence-to-sentence  [20] and image-to-image [39] translations, unpaired image-to-sentence translation is more challenging because of the significantly different characteristics of the two modalities. In contrast to the existing unpaired image captioning methods [11, 7], our proposed method adopts scene graph as an explicit representation to bridge the gap between image and sentence domains.

3 Method

In this section, we describe our unpaired image captioning framework. We first revisit the paired setting.

3.1 Paired Image Captioning Revisited

In the paired captioning setting, our training goal is to generate a caption from an image such that is similar to its ground truth caption. The popular encoder-decoder framework for image captioning can be formulated as:


where the encoder encodes the image into the image features with a CNN model [13], and the decoder predicts the image description from the image features

. The most common training objective is to maximize the probability of the ground truth caption words given the image:

, where corresponds to the Softmax output at time step . During inference, the word is drawn from the dictionary according to the Softmax distribution.

3.2 Unpaired Image Captioning

In the unpaired image captioning setting,

we have a dataset of images , and a dataset of sentences , where and are the total numbers of images and sentences, respectively. In this setting, there is no alignment between and . In fact, and can be completely unrelated coming from two different domains. Our goal is to train an image captioning model in a completely unsupervised way. In our setup, we assume that we have access to an off-the-shelf image scene graph detector and a sentence (or text) scene graph parser.

As shown in Figure 1, our proposed image captioning model consists of one image scene graph generator, one sentence scene graph generator, one scene graph encoder , one attention-based RNN decoder for sentence generation , and a cycle-consistent feature alignment module. Given an image as input, our method first extracts an image scene graph using the scene graph generator. It then maps to the sentence scene graph from which the RNN based decoder generates a sentence . More formally, the image captioner in the unpaired setting can be decomposed into the following submodels,


where and are the image scene graph and the sentence scene graph, respectively. The most crucial component in Eq. (3) is the unpaired mapping of image and text scene graphs. In our approach, this mapping is done in the feature space. In particular, we encode the image and sentence scene graphs into feature vectors and learn to map the feature vectors across the two modalities. We reformulate Eq. (3) as follows:


where is a graph encoder, is an RNN-based sentence decoder, and is a cross-modal feature mapper in the unpaired setting. In our implementation, we learn the scene graph encoder and the RNN-based decoder on the text modality first, and then we try to map the image scene graph into a common feature space (, the text space) so that the same sentence decoder can be used to decode the sentence from the mapped image features.

The sentence encoding and decoding processes can be formulated as the following two steps:


where is the reconstructed sentence. We train the model to enforce to be close to the original sentence .

In the following, we describe the scene graph generator in Sec. 3.2.1, the scene graph encoder in Sec. 3.2.2, the sentence decoder in Sec. 3.2.3, and our unpaired feature mapping process in Sec. 3.2.4.

3.2.1 Scene Graph Generator

Formally, a scene graph is a graph containing a set of nodes and a set of edges . As exemplified in Figure 1, the nodes can be of three types: object node, attribute node, and relationship node. We denote as the -th object, as the relation between object and , and as the -th attribute of object .

An image scene graph generator contains an object detector, an attribute classifier, and a relationship classifier. We use Faster-RCNN as the object detector 

[25], MOTIFS  [38] as the relationship detector, and an additional classifier for attribute identification [35].

To generate the sentence scene graph for a sentence, we first parse the sentence into a syntactic tree using the parser provided by [2], which uses a syntactic dependency tree built by [18]. Then, we transform the tree into a scene graph with a rule-based method [28].

3.2.2 Scene Graph Encoder

We follow [35] to encode a scene graph. Specifically, we represent each node as a -dimensional feature vector, and use three different spatial graph convolutional encoders to encode the three kinds of nodes by considering their neighbourhood information in the scene graph.

Encoding objects.

In a scene graph (image or sentence), an object can play either a subject or an object role in a relation triplet depending on the direction of the edge. Therefore, for encoding objects, we consider what relations they are associated with and what roles they play in that relation. Let denote the triplet for relation , where plays a subject role and plays an object role. The encoding for object , that is is computed by


where and are the embeddings (randomly initialized) representing the object and the relation , respectively; and are the spatial graph convolution operations for objects as a subject and as an object, respectively; and is the total number of relation triplets that is associated with in the scene graph.

Encoding attributes.

An object may have multiple attributes in the scene graph. The encoding of an object based on its attributes, , is computed by:


where is the total number of attributes that object has, and is the spatial convolutional operation for attribute based encoding.

Encoding relations.

Each relation is encoded into by considering the objects that the relation connects in the relation triplet,


where is the associated convolutional operation.

After graph encoding, for each image scene graph or sentence scene graph, we have three sets of embeddings:


where , , and can be different from each other. Figure 2 illustrates the encoding process.

Figure 2: The architectures for scene graph encoding, attention, and sentence decoding; , , and are attention modules for each kind of features, respectively.

3.2.3 Sentence Decoder

The goal of the sentence decoder is to generate a sentence from the encoded embeddings, , , and . However, these three sets of embeddings are of different lengths and contain different information. Therefore, their importance for the sentence decoding task also vary. In order to compute a relevant context for the decoding task effectively, we use three attention modules, one for each type of embeddings. The attention module over is defined as:


where is the associated (learnable) weight vector. The attentions over , and are similarly defined to get the respective attention vectors, and .

The attention vectors are then combined to get a triplet level embedding, which is then fed into an RNN-based decoder to generate the sentence . The following sequence of operations formally describes the process.


where is a neural network that generates a triplet level embedding, and is the cell output of the decoder at time step .

3.2.4 Training and Inference

We first train the graph encoder and the sentence decoder in the text modality (Eq. (6)), and then perform a feature level alignments for cross-modal unsupervised mapping.

Training in Text Modality. The graph convolutional encoding of a sentence scene graph into a feature representation , and reconstructing the original sentence from it are shown at the bottom part of Figure 1, where the encoder and the decoder are denoted as and , respectively. We first train and models by minimizing the cross-entropy (XE) loss:


where are the parameters of and , is the output probability of -th word in the sentence given by the sentence decoder.

We further employ a reinforcement learning (RL) loss that takes the entire sequence into account. Specifically, we take the CIDEr [31] score as the reward and optimize by minimizing the negative expected rewards as follows:


where is the reward calculated by comparing the sampled sentence with the ground-truth sentence using the CIDEr metric. In our model, we follow the RL approach proposed in [27, 10].

Figure 3: Conceptual illustration of our unpaired feature mapping. For each kind of embedding , there are two mapping functions and , and two associated adversarial discriminators and .

Unsupervised Mapping of Scene Graph Features. To adapt the learned model from sentence modality to the image modality, we need to translate the scene graph from the image to the sentence modality. We take the discrepancy in modality of the scene graphs directly into account by aligning the representation of the image scene graph with the sentence scene graph. We propose to use CycleGAN [39] to learn the feature alignment across domains.

Figure 3 illustrates our idea. Given two sets of unpaired features and , where , we have two mapping functions and , and two discriminators and . maps the image features to the sentence features, and maps the sentence features to the image features. The discriminators are trained to distinguish the real (original modality) features from the fake (mapped) features. The mappers are trained to fool the respective discriminators through adversarial training.

For image to text mapping, the adversarial loss is


Similarly, for sentence to image mapping, we have the similar adversarial loss,


Due to the unpaired setting, the mapping from the source to the target modality is highly under-constrained. To make the mapping functions cycle-consistent, CycleGAN introduces a cycle consistency loss to regularize the training,


where and are the reconstructed features in the image and text modalities, respectively.

Formally, our overall objective for unpaired feature mapping is to optimize the following loss:


where are the parameters of the two mapping functions and the discriminators for each kind of embedding , and

is a hyperparameter to control the regularization.

Cross-modal Inference. During inference, given an image , we generate its corresponding scene graph using a pre-trained image scene graph generator, use the scene graph encoder to get the image features , which are then mapped through the image-to-text mapper . The mapped features are then used for sentence generation using the sentence decoder. The cross-modal inference process can be formally expressed as:


where is the same module as Eq. (12).

4 Experiments

In this section, we evaluate the effectiveness of our proposed method. We first introduce the datasets and the experimental settings. Then, we present the performance comparisons as well as ablation studies to understand the impact of different components of our framework.

4.1 Datasets and Setting

Table 1 shows the statistics of the training datasets used in our experiments. We use Visual Genome (VG) dataset [19] to train our image scene graph generator. We filter the object, attribute, and relation annotations by keeping those that appear more than 2,000 times in the training set. The resulting dataset contains 305 objects, 103 attributes, and 64 relations (a total of 472 items).

We collect the image descriptions from the training split of MSCOCO [21] and use them as our sentence corpus to train the scene graph encoder and the sentence decoder. In pre-processing, we tokenize the sentences and convert all the tokens to lowercase. The tokens that appear less than five times are treated as UNK tokens. The maximum caption length is fixed to 16, and all the captions longer than 16 are truncated. This results in a base vocabulary of 9,487 words. For sentence scene graph generation, we generate the graph using the language parser in [2, 35]. We perform a filtering process by removing objects, relations, and attributes which appear less than 10 times in all the parsed scene graphs. After this filtering, we obtain 5,364 objects, 1,308 relations, and 3,430 attributes. This gives an extended vocabulary where the previous 9,487 words are consistent with the base vocabulary. The embeddings for the vocabulary items are randomly initialized.

For learning the mapping between the modalities, the unpaired training data is intentionally collected by shuffling the images and the sentences from MSCOCO randomly. We validate the effectiveness our method on the same test splits as used in [11, 7] for a fair comparison. The widely used CIDEr-D [31], BLEU [24], METEOR [5], and SPICE [2] are used to measure the quality of the generated captions.

Scene Graph Vocabulary Size
#Object #Attribute #Relation
Image (VG) 305 103 64
Sentence (MSCOCO) 5,364 3,430 1,308
Table 1: Statistics of the training datasets.
(a) Object Features (Raw)
(b) Relation Features (Raw)
(c) Attribute Features (Raw)
(d) Triplet Features (Raw)
(e) Object Features (Aligned)
(f) Relation Features (Aligned)
(g) Attribute Features (Aligned)
(h) Triplet Features (Aligned)
Figure 12: Visualization of features in 2D space by t-SNE [22]. We plot the scatter diagrams for 1,500 samples.

4.2 Implementation Details

We follow [35]

to train our image scene graph generator on VG. We first train a Faster-RCNN and use it to identity the objects in each image. We select at least 10 and at most 100 objects for an image. The object features extracted by RoI pooling are used as input to the object detector, the relation classifier, and the attribute classifier. We adopt the LSTM-based relation classifier from 


. Our attribute classifier is a single hidden layer network with ReLU activation (, fc-ReLU-fc-Softmax), and we keep only the three most probable attributes for each object.

For scene graph encoding, we set . We implement , , , , and (Eq. 7 - 12) as fully-connected layers with ReLU [23] activations. The two mapping functions in Eq. (17) and Eq. (18) are implemented as fully-connected layers with leaky ReLU [33] activations.

The sentence decoder has two LSTM layers. The input to the first LSTM is the word embeddings and its previous hidden state. The input to the second LSTM is the concatenation of three terms: the triplet embedding , the output from the first LSTM, and its previous hidden state. We set the number of hidden units in each LSTM to 1,000.

During training, we first train the network with the cross-entropy loss (Eq. (15)) for epochs and then fine-tune it with RL loss in Eq. (16). The learning rate is initialized to for all parameters, and decayed by after every epoch. We use Adam  [17] for optimization with a batch size of . During the (unpaired) alignment learning, we freeze the parameters of the scene graph encoder and the sentence decoder, and only learn the mapping functions and the discriminators. We set to 10 in Eq. (21). During inference, we use beam search with a beam size of 5.

For quantifying the efficacy of the proposed framework, we use several baselines for performance comparison.

Graph-Enc-Dec (Avg). This baseline learns the graph encoder and the sentence decoder only on sentence corpus. It takes the average operation (as opposed to an attention) over the three sets of features: , , and . During testing, we directly feed the image scene graph to this model and get the image description.

Graph-Enc-Dec (Att). This model shares the same setting with Graph-Enc-Dec (Avg), but replaces the average operation with a shared attention mechanism for all three sets (i.e., same attention for object, attribution and relation).

Graph-Enc-Dec (Att). This model modifies the Graph-Enc-Dec (Att) with an independent attention mechanism for each set of features.

Graph-Align. This is our final model. It is initialized with the trained parameters from Graph-Enc-Dec (Att) that uses separate attentions, and then it also learns the feature mapping functions using adversarial training.

4.3 Quantitative Results

4.3.1 Investigation on Sentence Decoding

In this experiment, we first train the network with Eq. (15), and then fine-tune it with Eq. (16) on the sentence corpus. Table 2

compares the results of three baseline models on the sentence corpus. It can be seen that the attention-based model performs better than the average-based model in all metrics, which demonstrates that weighting over features can better model the global dependency of features. Note that separate attention model for each set of features can significantly improve the performance.

The inconsistent alignment of three kinds of features in Figure 12 also supports that we should treat these sets of features separately.

Methods B@1 B@2 B@3 B@4 M C
Graph-Enc-Dec(Avg) 84.3 71.8 58.8 47.1 31.0 129.4
Graph-Enc-Dec(Att) 91.8 80.3 67.5 55.5 34.3 151.4
Graph-Enc-Dec(Att) 94.1 84.6 72.9 61.5 36.3 168.8
Table 2: Results for different sentence scene graph decoders on MSCOCO test split, where B@n refers to BLEU-n, M refers to METEOR, and C refers to CIDEr. All values are reported in percentage (bold numbers are the best results).
Methods B@1 B@2 B@3 B@4 M C
Graph-Enc-Dec(Avg) 52.1 34.1 23.8 17.6 14.9 41.4
Graph-Enc-Dec(Att) 54.3 37.0 26.8 20.3 15.9 47.2
Graph-Enc-Dec(Att) 56.0 33.6 20.1 11.9 17.0 48.5
Table 3: Results for different baselines without GAN training on the test split of the MSCOCO.

4.3.2 Investigation on Unpaired Setting without GAN

Table 3 shows the comparisons among different baselines when no explicit cross-modal mapping of the features is done. By feeding the image scene graph directly to the trained scene graph encoder and the sentence decoder, we can achieve promising performance on the test set. Graph-Enc-Dec (Att) still achieves the best performance in all metrics. This is reasonable since both scene graphs and captions are high-level understandings of the image, and by capturing rich semantic information about objects and their relationships, scene graphs provide an effective way to connect an image to its natural language description. This finding also validates the feasibility of our approach to unpaired image captioning through the use of scene graphs. However, compared to the paired setup (see Table 5), these results are still inferior, meaning that only scene graph is not enough to achieve comparable performance.

Figure 13: Qualitative examples of different methods. In each example, the left image is the original input image; the middle is the image scene graph; the right image is the ground-truth sentence scene graph for compassion.

4.3.3 Investigation on Unpaired Setting with GAN

GAN Loss Discriminator B@1 B@2 B@3 B@4 C
BCE 64.9 44.2 28.6 18.1 63.0
66.0 46.0 30.3 19.7 65.5
65.5 45.4 29.6 18.8 65.2
MSE 65.3 44.8 28.9 18.3 62.9
66.0 45.9 29.7 18.8 63.8
58.4 36.3 21.7 12.6 46.7
GP 66.1 46.1 30.3 19.5 65.5
67.1 47.8 32.3 21.5 69.5
64.5 44.2 28.5 17.9 61.1
Table 4: Ablation studies of different GAN losses for Graph-Align model.
Paired Setting
SCST [27] 33.3 26.3 55.3 111.4
Stack-Cap [10] 78.6 62.5 47.9 36.1 27.4 56.9 120.4 20.9
SGAE (base) [36] 79.9 36.8 27.7 57.0 120.6 20.9
Unpaired Setting
Language Pivoting [11] 46.2 24.0 11.2 5.4 13.2 17.7
Adversarial+Reconstruction [7] 58.9 40.3 27.0 18.6 17.9 43.1 54.9 11.1
Graph-Align 67.1 47.8 32.3 21.5 20.9 47.2 69.5 15.0
Table 5: Performance comparisons on the test split of the MSCOCO dataset.

To align the features from the image modality to the text modality, we use CycleGAN with our Graph-Enc-Dec(Att) model. Table 4

shows the comparisons of three kinds of GAN loss: binary cross entropy (BCE) loss with logits (the vanilla GAN loss

[8]), mean squared error (MSE) loss, and gradient penalty (GP) [15]. We also compare the results for using different output dimensions in the discriminator.111For example, for a dimension of 64, the output is a 64-dimensional vector, which is compared against an all-one vector of length 64 for a ‘Real’ input, and with an all-zero vector of length 64 for a ‘Fake’ input.

We can see that most of the CycleGAN variants improve the performance substantially compared to the results in Table 3. The GP with 64-dimension discriminator output achieves the best performance. Note that, when we set the output dimension to 1, the performance drops. This indicates that a strong discriminator is crucial for unpaired feature alignments. From the bottom row of Figure 12, we can see that with the help of the mapping module, the three kinds of embeddings are aligned very well, especially the attribute embedding (Figure (g)g). It is also worth noting that the triplet features in Figure (h)h are better aligned compared to the raw triplet features in Figure (d)d.

Finally, Table 5 compares the results of the Graph-Align model with those of the existing unpaired image captioning methods [11, 7] on the MSCOCO test split. We can notice that our proposed Graph-Align achieves the best performance in all metrics. This demonstrates the effectiveness of our graph-based unpaired image captioning model.

4.4 Qualitative Results

Figure 13 visualizes some examples of our models. We show the generated image descriptions using different models along with the ground truth captions (bottom part). In the generated image and sentence scene graphs, we mark object, relation, attribute nodes in orange, blue and green, respectively. From these exemplary results, we observe that our method can generate reasonable image descriptions by aligning the unpaired visual-textual modalities with the help of scene graphs. Also we observe that the number of attributes (words in green) in the sentence scene graph is less than that in the image scene graph. This observation potentially explains why there is a huge feature embedding gap between image and text in Figure (c)c.

5 Conclusions

In this paper, we have proposed a novel framework to train an image captioning model in an unsupervised manner without using any paired image-sentence data. Our method uses scene graphs as an intermediate representation of the image and the sentence, and maps the scene graphs in their feature space through cycle-consistent adversarial training. We used graph convolution and attention methods to encode the objects, their attributes, their relationships in a scene graph. Our experimental results based on quantitative and qualitative evaluations show the effectiveness of our method in generating meaningful captions, which also outperforms existing methods by a good margin. In future, we would like to evaluate our method on other datasets and explore other mapping methods such as optimal transport.


  • [1] English-speaking world, 2019. [Online; accessed 22-March-2019].
  • [2] P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  • [4] M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. In ICLR, 2018.
  • [5] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop, 2005.
  • [6] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In CVPR, 2015.
  • [7] Y. Feng, L. Ma, W. Liu, and J. Luo. Unsupervised image captioning. In CVPR, 2019.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [9] J. Gu, J. Cai, S. Joty, L. Niu, and G. Wang. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In CVPR, 2018.
  • [10] J. Gu, J. Cai, G. Wang, and T. Chen. Stack-captioning: Coarse-to-fine learning for image captioning. In AAAI, 2017.
  • [11] J. Gu, S. Joty, J. Cai, and G. Wang. Unpaired image captioning by language pivoting. In ECCV, 2018.
  • [12] J. Gu, G. Wang, J. Cai, and T. Chen. An empirical study of language cnn for image captioning. In ICCV, 2017.
  • [13] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J. Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, pages 354–377, 2017.
  • [14] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling. Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
  • [15] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, 2017.
  • [16] J. Hitschler, S. Schamoni, and S. Riezler. Multimodal pivots for image caption translation. In ACL, 2016.
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [18] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.
  • [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  • [20] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato. Unsupervised machine translation using monolingual corpora only. In ICLR, 2018.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [22] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 2008.
  • [23] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • [24] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, 2002.
  • [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [26] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In CVPR, 2017.
  • [27] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In CVPR, 2017.
  • [28] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning.

    Generating semantically precise scene graphs from textual descriptions for improved image retrieval.

    In ACL, 2015.
  • [29] K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston. Engaging image captioning via personality. In CVPR, 2019.
  • [30] D. Teney, L. Liu, and A. van den Hengel. Graph-structured representations for visual question answering. In CVPR, 2017.
  • [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. PAMI, 2017.
  • [33] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
  • [34] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • [35] X. Yang, K. Tang, H. Z. Zhang, and J. Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.
  • [36] X. Yang, H. Zhang, and J. Cai. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In ECCV, 2018.
  • [37] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.
  • [38] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi. Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
  • [39] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In ICCV, 2017.