Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation

by   Jing Wang, et al.
Nanjing University

Image paragraph generation is the task of producing a coherent story (usually a paragraph) that describes the visual content of an image. The problem nevertheless is not trivial especially when there are multiple descriptive and diverse gists to be considered for paragraph generation, which often happens in real images. A valid question is how to encapsulate such gists/topics that are worthy of mention from an image, and then describe the image from one topic to another but holistically with a coherent structure. In this paper, we present a new design --- Convolutional Auto-Encoding (CAE) that purely employs convolutional and deconvolutional auto-encoding framework for topic modeling on the region-level features of an image. Furthermore, we propose an architecture, namely CAE plus Long Short-Term Memory (dubbed as CAE-LSTM), that novelly integrates the learnt topics in support of paragraph generation. Technically, CAE-LSTM capitalizes on a two-level LSTM-based paragraph generation framework with attention mechanism. The paragraph-level LSTM captures the inter-sentence dependency in a paragraph, while sentence-level LSTM is to generate one sentence which is conditioned on each learnt topic. Extensive experiments are conducted on Stanford image paragraph dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, CAE-LSTM increases CIDEr performance from 20.93


page 3

page 6


Exploring Visual Relationship for Image Captioning

It is always well believed that modeling relationships between objects w...

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

Observing a set of images and their corresponding paragraph-captions, a ...

Hierarchy Parsing for Image Captioning

It is always well believed that parsing an image into constituent visual...

Sparse Graph to Sequence Learning for Vision Conditioned Long Textual Sequence Generation

Generating longer textual sequences when conditioned on the visual infor...

Video Captioning with Transferred Semantic Attributes

Automatically generating natural language descriptions of videos plays a...

Contextual LSTM (CLSTM) models for Large scale NLP tasks

Documents exhibit sequential structure at multiple levels of abstraction...

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Visual Storytelling (VIST) is a task to tell a narrative story about a c...

1 Introduction

The recent advances in Convolutional Neural Networks (CNN) have successfully pushed the limits and improved the state-of-the-art technologies of image understanding. As such, it has become achievable to recognize an image with a pre-defined set of categories. In a further step to describe an image in a natural-language utterance, image captioning

[22, 24, 18] has expanded the understanding from individual labels to a sentence to reflect the visual content in the image. Nevertheless, considering that a single sentence certainly has a upper-bound of descriptive capacity and thus fails to recapitulate every details in the image, the task of image paragraph generation is recently introduced in [9, 12]. The ultimate goal is to generate a coherent paragraph that describes an image in a finer manner.

Figure 1: Modeling the topics from an image by employing (a) recurrent method and (b) convolutional auto-encoding (ours).

The difficulty of image paragraph generation originates from two aspects: 1) how to distill the gists/topics that are worthy of mention from an image? 2) how to describe each topic in one sentence while ensuring the coherence across sentences to form a paragraph? One straightforward way in [9]

to alleviate the first issue relies on Recurrent Neural Networks (RNN), e.g., Long Short-Term Memory (LSTM) network. The region-level features are encoded into a global vector via mean pooling, which is subsequently input into LSTM to decode the topics, as shown in Figure

1(a). In this case, the inherent structure among all regions is not fully exploited, making it difficult to encapsulate the topics in the image. Furthermore, there is no clear evidence that the outputs of LSTM could characterize the topics well. Instead, we novelly devise Convolutional Auto-Encoding (CAE) structure to model the topics of an image in this paper as illustrated in Figure 1(b). CAE, on one hand, abstracts the topics in the encoder by employing convolutions over the region-level features, and on the other, steers the deconvolutional decoder through reconstruction from topics to features. As such, the learnt topics are potentially more representative and contain the information needed. To address the second issue, we remould a two-level LSTM-based paragraph generation framework, in which paragraph-level LSTM models the dependency holistically across all the sentences in a paragraph and sentence-level LSTM generates words in sequence conditioning on each learnt topic.

By consolidating the idea of modeling sentence topics for image paragraph generation, we present a new Convolutional Auto-Encoding plus Long Short-Term Memory (CAE-LSTM) architecture, as shown in Figure 2. Specifically, Faster R-CNN is firstly implemented to detect a set of salient image regions. We purely perform convolutions over all the region-level features in the encoder to distill the knowledge and extract the topics in the image. The learnt topics are ensured to capture holistic and representative information through achieving high reconstruction quality by the deconvolutional decoder. After that, we fuse all the region-level features via mean pooling as image representation, which are fed into a two-level LSTM networks for paragraph generation. The paragraph-level LSTM typically explores the dependency recursively throughout the procedure of paragraph generation and outputs a new initial state at a time for sentence-level LSTM with the paragraph history. As such, the generation of one sentence is affected by semantic context from the previous sentences when producing the paragraph. The sentence-level LSTM generates the sentence conditioning on each learnt topic, one word at each time step. Please also note that we uniquely design the metric of Coverage which is to encourage global coverage of objects in the sentence and consider such metric as a reward in the self-critical training strategy. The whole CAE-LSTM framework is jointly optimized.

The main contribution of this work is the proposal of modeling gists/topics in an image to boost image paragraph generation. This also leads to the elegant views of how to distill the knowledge and abstract the topics in an image, and how to nicely integrate the learnt topics into sentence generation and ensure inter-sentence dependency in a paragraph, which are problems not yet fully understood in the literature.

2 Related Work

Image Captioning.

The dominant paradigm in modern image captioning is sequence learning methods [5, 11, 22, 24, 28, 13, 23, 27, 26, 25, 18] which utilize CNN plus RNN model to generate free-form and content-relevant captions. [22] proposes an end-to-end neural network architecture by utilizing LSTM to generate sentence for an image. Later, by amending the discrepancy between training and inference distributions for sequence modeling, [18] presents a self-critical sequence training strategy to further enhance image captioning. Most recently, a combined bottom-up and top-down attention mechanism [1] is proposed for image captioning, which enables attention at the level of objects and salient image regions.

Image Paragraph Generation.

The task of image paragraph generation has received increasing attention recently, which describes an image with a long, descriptive, and coherent paragraph. [9] is one of the early works that leverages a hierarchical RNN for generating paragraph with the input of region features. In particular, a sentence RNN recursively generates sentence topic vectors conditioned on the global vector (i.e., the mean pooling of region-level features) and a word RNN is subsequently adopted to decode each topic into output sentence. [12] extends the hierarchical RNN by involving multi-level adversarial discriminators for paragraph generation. The paragraph generator is thus enforced to produce realistic paragraphs with smooth logical transition between sentence topics. Furthermore, [3] augments the hierarchical RNN with coherence vectors, global topic vectors, and a formulation of Variational Auto-Encoders [8] to further model the inherent ambiguity of associating paragraphs with images.


In short, our approach focuses on the latter scenario, which produces a coherent paragraph to depict an image. Unlike most of the aforementioned methods that recursively decode gists/topics for paragraph generation via LSTM conditioned on the global vector, our work contributes by exploiting a convolutional and deconvolutional auto-encoding module for gists/topics modeling over region-level features. Such design not only abstracts the topics in the convolutional encoder by exploiting inherent structure among all regions, but also steers the deconvolutional decoder through reconstruction to enforce the distilled topics more representative and informative. Moreover, a reward of Coverage is adopted in the self-critical training strategy to encourage global coverage of objects in the paragraph.

Figure 2: An overview of our Convolutional Auto-Encoding plus Long Short-Term Memory (CAE-LSTM) architecture for image paragraph generation. All region-level features of image regions detected by Faster R-CNN are firstly injected into Convolutional Auto-Encoding (CAE) module for triggering the modeling of topics in the image. In particular, a convolutional encoder is leveraged to distill the knowledge from region-level features and extract the topics. Next, a deconvolutional decoder is steered to reproduce the region-level features from the distilled topics. As such, the learnt topics are ensured to capture the holistic and representative information that are worthy of mention from all regions. After that, we fuse all the region-level features via mean pooling as the image feature, which is fed into a two-level LSTM-based paragraph generator with attention (Att) for paragraph generation. Here the paragraph-level LSTM typically explores the inter-sentence dependency in a paragraph with the recursive input of mean-pooled image feature and the sentence-level LSTM generates the sentence conditioning on each learnt topic. Moreover, we include a reward of Coverage to encourage global coverage of objects in the paragraph for self-critical training.

3 Approach

We devise our Convolutional Auto-Encoding plus Long Short-Term Memory (CAE-LSTM) architecture to facilitate image paragraph generation by modeling gists/topics in an image. CAE-LSTM firstly utilizes a convolutional encoder to encapsulate region-level features into the topics, which are endowed with the holistic and representative information through achieving high reconstruction quality by deconvolutional decoder. The distilled topics are further integrated into a two-level LSTM-based paragraph generator, enabling the inter-sentence dependency modeling in a paragraph via paragraph-level LSTM and topic-oriented sentence generation through sentence-level LSTM. The overall training of CAE-LSTM is performed by exploring the reconstruction loss to pursue high-quality reconstruction from topics to region-level features, and the sequence-level reward (e.g., CIDEr) plus coverage reward in self-critical training to encourage the maximum coverage of objects in the paragraph. An overview of our model is depicted in Figure 2.

3.1 Notation

Suppose we have an image to be described by a paragraph , where consisting of sentences. Each sentence is composed of words and each word is represented as -dimensional textual feature . Faster R-CNN [17] is firstly leveraged to produce a set of detected objects with image regions in . denotes the -dimensional feature of each image region .

3.2 CAE for Topic Modeling

The most typical way to distill the topics from an image is to encode region-level features into a global vector via mean pooling and a LSTM-based decoder is subsequently utilized to recursively output topics conditioning on the global vector, while the inherent structure among all regions is not fully exploited. Furthermore, there is no clear evidence that the outputs of LSTM could characterize the topics well. Here we devise a Convolutional Auto-Encoding (CAE) module for topic modeling. The spirit behind follows the philosophy that the holistical topics abstraction over all region-level features via convolutional encoder and the reconstruction from topics to features via deconvolutional decoder can enforce the learnt topics to be more representative and informative.

Convolutional Encoder.

Given the set of detected regions , a convolutional encoder is leveraged to encapsulate all the image regions into topic vectors, which is purely applied with convolutions. Specifically, each region is firstly embedded into a -dimensional region-level feature through a linear layer. A region feature map is thus constructed by concatenating all region-level features (i.e., is the -th column of V), which will be set as the input of convolutional encoder. Here , , and denote the width, height, and the number of channels for the input feature map V, respectively. Next, we utilize a convolutional layer () to encode the feature map V into the topic feature map consisting of topic vectors:


For the convolutional layer , the size of the convolutional filter is set as

, the stride size is

, and the filter number is . Here each vector along channel dimension within the topic feature map can be regarded as the -th distilled topic vector .

Deconvolutional Decoder.

For deconvolutional decoder, we perform deconvolution (i.e., the conjugate operation of convolution) to decode the distilled topic feature map, , back to the region feature map. In particular, given the distilled topic feature map , a deconvolutional layer () is adopted to reproduce the region feature map :


The filter size and the stride size in are set the same as the convolutional layer . Here the -th column of the reconstructed region feature map corresponds to the reconstructed region-level feature for -th image region.

Reconstruction Loss.

To measure the quality of reconstruction for our CAE module, a global reconstruction loss is defined as the L distance between the input region feature map V and reconstructed region feature map :


We also experimented with L distance as the reconstruction loss but that did not make a major difference. By minimizing the reconstruction loss, the distilled topic vectors are enforced to capture holistic and representative information from all regions through achieving higher reconstruction quality.

Note that besides acting as the gists to guide the sentence generation in the followed two-level LSTM networks, the distilled topic vectors can also be utilized to determine the number of generated sentences. Specifically, each topic vector is additionally injected into a linear layer to obtain a distribution over two states {CONTINUE=0, STOP=1}, which determines whether the sentence is the last one in the paragraph.

Figure 3: Illustration of two-level LSTM-based paragraph generator.

3.3 Two-level LSTM-based Paragraph Generator

Inspired by the two-layer LSTM structure with top-down attention mechanism in [1], we remould a two-level LSTM-based paragraph generator consisting of a paragraph-level LSTM for inter-sentence dependency modeling and a sentence-level LSTM for sentence generation conditioning on each distilled topic, as illustrated in Figure 3.

To be specific, at each time step for generating the -th sentence in a paragraph, the input vector of paragraph-level LSTM is defined as the concatenation of the previous output of sentence LSTM , the mean-pooled image feature , and the embedding of previously generated word :


where is transformation matrix for input word. Such input collects the maximum contextual information for paragraph-level LSTM to model inter-sentence dependency.

Next, given the output of paragraph-level LSTM and the corresponding topic vector , a normalized attention distribution over all regions is measured as


where is the -th element of , , , , and are transformation matrices, respectively. Here the -th element of the attention distribution

denotes the attention probability of

. The attended image feature is thus calculated by aggregating all region-level features weighted by attention.

We further set the concatenation of the attended image feature , the corresponding topic vector and the output of paragraph-level LSTM, as the input of sentence-level LSTM for topic-oriented sentence generation:


The output

of sentence-level LSTM is fed into a softmax layer to predict the next word


Note that the initial state of sentence-level LSTM is set to zero at every beginning of sentence generation. The state of paragraph-level LSTM is initialized with zero only at the beginning of the entire paragraph generation since paragraph-level LSTM should explore inter-sentence dependency recursively throughout whole procedure of paragraph generation.

3.4 Self-critical Training with Coverage Reward

To further boost our paragraph generation model by amending the discrepancy between training and inference, we adopt the self-critical training strategy [18] to optimize two-level LSTM-based paragraph generator. Despite having high quantitative scores, qualitative analysis shows that the captioning models solely optimized with sequence-level loss/reward are often limited to describing very generic information of objects, or rely on prior information and correlations from training examples, and resulting frequently in undesired effects such as object hallucination [14]. As a result, besides the sequence-level reward (e.g., CIDEr), we additionally include a coverage reward in self-critical training strategy to encourage the global coverage of objects in the paragraph and thus improve the paragraph.

Specifically, we select the top-1 frequent objects from the vocabulary in the training data as the high-frequency objects. The coverage reward is defined as the coverage ratio of high-frequency objects for the generated paragraph relative to ground-truth paragraph: , where denotes the number of elements in a set. Here and represent the set of high-frequency objects mentioned in the generated paragraph and ground-truth paragraph respectively. Accordingly, the final reward in self-critical training strategy is measured as the combination of CIDEr reward () and coverage reward: , where is the tradeoff parameter.

4 Experiments

4.1 Datasets and Experimental Settings


We conducted the experiments and evaluated our CAE-LSTM on Stanford image paragraph dataset (Stanford) [9], a benchmark in the field of image paragraph generation. The dataset contains 19,551 images and there is one human-annotated paragraph per image. On average, each paragraph has 67.5 words and each sentence consists of 11.91 words. In our experiments, we follow the widely used settings in [9] and take 14,575 images for training, 2,487 for validation and 2,489 for testing.

Compared Methods.

We compare the proposed method with the following state-of-the-art methods: (1) Image-Flat [7] is a standard image captioning model which directly decodes an image into a paragraph word by word, via a single LSTM. (2) Regions-Hierarchical [9] adopts a hierarchical LSTM to generate a paragraph, sentence by sentence. (3) RTT-GAN [12] integrates sentence attention and word attention into the hierarchical LSTM, coupled with adversarial training strategy. (4) CapG-RevG [3] leverages coherence vectors/global topic vectors to generate coherent paragraphs and maintains the diversity of the paragraphs by a variational auto-encoder formulation. (5) CAE-LSTM is the proposal in this paper. Moreover, one degraded version of CAE-LSTM, i.e., LSTM-ATT, is devised to model topics via LSTM (instead of CAE) and adopt the same two-level LSTM architecture with attention for paragraph generation, which is trained without self-critical training strategy.


For each image, we apply Faster R-CNN to detect objects within this image and select top regions with highest detection confidences to represent the image. Each region is represented as the 4,096-dimensional output of fc7 layer after RoI pooling from the conv5 3 feature map of Faster R-CNN (backbone: VGG16 [19]). The Faster R-CNN is pre-trained over Visual Genome [10], similar to [1]. To build the vocabulary, all the words from the training set are converted to lowercase and those which appear less than 4 times are omitted. Each word is represented as “one-hot” vector (binary index vector in a vocabulary). The maximum sentence number

is 6 and the maximum word number in a sentence is 20 (padded where necessary). For our CAE, the convolutional filter size in the convolutional layer is set as

with stride . The dimensions of the embedded region-level feature and distilled topic vector are set as and . For the two-level LSTM networks, the dimension of hidden state in each LSTM is . The dimension of the hidden layer for measuring attention distribution is . Three popular metrics are adopted: METEOR [2], CIDEr [21], and BLEU-4 [16]. We compute all metrics with the released code from Microsoft COCO Evaluation Server [4].

Implementation Details.

We follow the two-phrase training recipe in [18] to train our CAE-LSTM. For the first phrase, we set the learning rate as

and the training of CAE-LSTM is thus performed by jointly leveraging the reconstruction loss in CAE and the cross entropy loss in two-level LSTM paragraph generator. Here we evaluate the model at each epoch on the validation set and select the model with the best CIDEr score as an initialization for the next training phrase. For the second phrase of self-critical training, the learning rate is set as

and CAE-LSTM is further optimized with the combination of CIDEr reward and coverage reward. During inference, we adopt inference constraint in [15] to penalize trigram repetition. The tradeoff parameter

is set as 8 according to the validation performance. Note that Batch normalization

[6] and dropout [20] (dropout rate: ) are applied in our experiments.

One issue of training our CAE-LSTM is to determine the order of the regions in the concatenation of region-level features as the input of CAE module. In the experiments, the region orders are determined according to the objectiveness score of each image region. We additionally studied the region orders by sorting all the regions with regard to the confidence score of bounding box, or simply using fixed random order. The CIDEr score constantly fluctuates within 0.008 when using different orders of regions for concatenation in our CAE-LSTM. The results indicate that the performance is not sensitive to the selection of region order.

Method C M B-4
Image-Flat [7] 11.06 12.82 7.71
Regions-Hierarchical [9] 13.52 15.95 8.69
RTT-GAN [12] 20.36 18.39 9.21
CapG-RevG [3] 20.93 18.62 9.43
LSTM-ATT 20.17 17.72 8.72
CAE-LSTM 25.15 18.82 9.67
Table 1: Performance of our CAE-LSTM and other state-of-the-art methods on Stanford, where C, M, and B-4 are short for CIDEr, METEOR, and BLEU-4. All values are reported as percentage (%).

4.2 Performance Comparison and Analysis

Quantitative Analysis.

The performances of different models on the Stanford dataset are shown in Table 1

. Overall, the results across three evaluation metrics consistently indicate that our proposed CAE-LSTM achieves better performances against other state-of-the-art techniques including non-attention models (Image-Flat, Regions-Hierarchical, and CapG-RevG) and attention-based approaches (LSTM-ATT and RTT-GAN). Specifically, the CIDEr and METEOR scores of our CAE-LSTM can achieve 25.15% and 18.82%, which makes the relative improvement over the best competitor CapG-RevG by 20.2% and 1.1%, respectively. As expected, by additionally modeling topics/gists in an image via LSTM, Regions-Hierarchical exhibits better performance than Image-Flat which leaves inter-sentence dependency unexploited. Moreover, LSTM-ATT leads to a performance boost over Regions-Hierarchical, which directly encodes an image as a global representation by performing mean pooling over all region-level features. The results basically indicate the advantage of region-level attention mechanism in the two-level LSTM networks by learning to focus on the image regions that are most indicative to infer the next word. Most specifically, RTT-GAN and CapG-RevG by modeling reality and diversity of paragraphs with Generative Adversarial Networks and Variational Auto-Encoders, improves LSTM-ATT. However, the performances of RTT-GAN and CapG-RevG are still lower than our CAE-LSTM which exploits the inherent structure among all image regions for topic modeling in a convolutional and deconvolutional auto-encoding framework.

Figure 4: Examples for paragraphs generated by LSTM-ATT, CAE-LSTM and human-annotated Ground Truth paragraphs (GT) on Stanford dataset (better viewed in color). The words coded with same colors indicate the accurate semantic matches between the generated paragraphs and ground truth paragraphs. We underline the semantically similar phrases/sentences in a paragraph.

Qualitative Analysis.

Figure 4 shows several paragraph examples generated by LSTM-ATT, CAE-LSTM and one human-annotated Ground Truth (GT) paragraph. From these exemplar results, it is easy to see that all of these paragraph generation models can produce somewhat relevant paragraphs, while our proposed CAE-LSTM can generate coherent and accurate paragraphs by learning to distill the gists/topics from an image via a convolutional auto-encoding module to guide paragraph generation. For example, compared to LSTM-ATT which repeatedly generates the phrase “holding a tennis racket” in the -th and -th sentences for the first image, the produced paragraph for our CAE-LSTM is more coherent to describe the diverse and accurate topics in the image. The results again verify that the convolutional auto-encoding module in our CAE-LSTM could characterize the topics better than LSTM. Moreover, by additionally measuring the coverage of objects as a reward in the self-critical training strategy, our CAE-LSTM is encouraged to produce paragraphs which cover more objects in images, leading to more descriptive paragraph with objects, e.g., “shirt” and “shorts.”

Figure 5: The performance improvement by considering one more design (i.e., Convolutional Encoding (CE), Reconstruction Loss (RL), and Self-critical training with Coverage (SC)) in CAE-LSTM, over the performance of LSTM-ATT across different metrics.

Ablation study.

Here we study how each design in our CAE-LSTM influences the overall performance. Convolutional Encoding (CE) replaces the LSTM for topic modeling in LSTM-ATT with the convolutional encoder in our devised CAE module. Reconstruction Loss (RL) enforces the distilled topics to capture holistic and representative information via reconstruction from topics to region-level features. Self-critical training with Coverage (SC) further encourages global coverage of objects with a coverage reward in self-critical training. Figure 5 shows the performance improvement by considering one more design for paragraph generation in our CAE-LSTM, over the performance of LSTM-ATT across different metrics. The results across different metrics consistently indicate that the topic modeling via convolutional encoding in CE leads to a performance boost compared to LSTM-ATT which capitalizes on LSTM to explore dependency among topics. Furthermore, by integrating reconstruction loss into the learning of topics, CE+RL exhibits better performance than CE. The results demonstrate the advantage of reconstruction from topics to region-level features in our CAE. By additionally exploiting coverage reward in self-critical training, CE+RL+SC further boosts up the performances.

Human Evaluation.

To better understand how satisfactory are the paragraphs generated from different methods, we conducted Turing Test to evaluate our CAE-LSTM against the baseline LSTM-ATT. In particular, six evaluators are invited and a subset of 1,000 images is randomly selected from testing set for human evaluation. We show all the evaluators once only one paragraph generated by different approach or human annotation and they are asked: can you determine whether the given paragraph has been generated by a system or by a human being? From evaluators’ responses, we calculate the percentage of paragraphs that pass the Turing Test. The results of Turing Test for Human, CAE-LSTM, and LSTM-ATT are 88.5%, 39.8%, and 14.7%, which clearly show that our CAE-LSTM outperforms LSTM-ATT.

5 Conclusion

We have present Convolutional Auto-Encoding plus Long Short-Term Memory (CAE-LSTM), which explores the modeling of sentence topics to boost image paragraph generation. Particularly, we study the problem from the viewpoint of topic distillation in an auto-encoding manner. To verify our claim, we have devised a purely convolutional structure, in which an encoder abstracts topics by employing convolutions over region-level image features and a deconvolutional decoder is to validate the topics through high-quality reconstruction from topics to features. A two-level LSTM-based paragraph generator is then remould, on one hand, to model dependency across sentences in a paragraph, and on the other, to generate the sentence conditioning on each learnt topic. Experiments conducted on Stanford image paragraph dataset verify our proposal and analysis. Performance improvements are observed when comparing to state-of-the-art image paragraph generation techniques.


This work was partially supported by the National Natural Science Foundation of China (Grant No. 61732007 and U1611461).


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and vqa. In CVPR, Cited by: §2, §3.3, §4.1.
  • [2] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL workshop, Cited by: §4.1.
  • [3] M. Chatterjee and A. G. Schwing (2018) Diverse and coherent paragraph generation from images. In ECCV, Cited by: §2, §4.1, Table 1.
  • [4] X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.1.
  • [5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In CVPR, Cited by: §2.
  • [6] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §4.1.
  • [7] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §4.1, Table 1.
  • [8] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
  • [9] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In CVPR, Cited by: §1, §1, §2, §4.1, §4.1, Table 1.
  • [10] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV. Cited by: §4.1.
  • [11] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei (2019) Pointing novel objects in image captioning. In CVPR, Cited by: §2.
  • [12] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing (2017) Recurrent topic-transition gan for visual paragraph generation. In ICCV, Cited by: §1, §2, §4.1, Table 1.
  • [13] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy (2017) Optimization of image description metrics using policy gradient methods. In ICCV, Cited by: §2.
  • [14] J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In CVPR, Cited by: §3.4.
  • [15] L. Melas-Kyriazi, A. Rush, and G. Han (2018) Training for diversity in image paragraph captioning. In EMNLP, Cited by: §4.1.
  • [16] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.1.
  • [17] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §3.1.
  • [18] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In CVPR, Cited by: §1, §2, §3.4, §4.1.
  • [19] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. JMLR. Cited by: §4.1.
  • [21] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In CVPR, Cited by: §4.1.
  • [22] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In CVPR, Cited by: §1, §2.
  • [23] J. Wang, J. Fu, J. Tang, Z. Li, and T. Mei (2018) Show, reward and tell: automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI, Cited by: §2.
  • [24] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §1, §2.
  • [25] T. Yao, Y. Pan, Y. Li, and T. Mei (2017) Incorporating copying mechanism in image captioning for learning novel objects. In CVPR, Cited by: §2.
  • [26] T. Yao, Y. Pan, Y. Li, and T. Mei (2018) Exploring visual relationship for image captioning. In ECCV, Cited by: §2.
  • [27] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei (2017) Boosting image captioning with attributes. In ICCV, Cited by: §2.
  • [28] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo (2016) Image captioning with semantic attention. In CVPR, Cited by: §2.