Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

by   Dandan Guo, et al.
Xidian University

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating semantic topics into this task, this paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework, which couples a visual extractor with a deep topic model to guide the learning of a language model. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we design a variational inference network to build the mapping from image features to textual captions. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model, including Long Short-Term Memory (LSTM) and Transformer, and jointly optimized. Experiments on public dataset demonstrate that the proposed models, which are competitive with many state-of-the-art approaches in terms of standard evaluation metrics, can be used to both distill interpretable multi-layer topics and generate diverse and coherent captions.


page 4

page 12

page 14

page 16


Convolutional Auto-encoding of Sentence Topics for Image Paragraph Generation

Image paragraph generation is the task of producing a coherent story (us...

Image Captioning using Facial Expression and Attention

Benefiting from advances in machine vision and natural language processi...

Semantic Compositional Networks for Visual Captioning

A Semantic Compositional Network (SCN) is developed for image captioning...

Guiding Attention using Partial-Order Relationships for Image Captioning

The use of attention models for automated image captioning has enabled m...

Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning

Automatic generation of video captions is a fundamental challenge in com...

Deep soccer captioning with transformer: dataset, semantics-related losses, and multi-level evaluation

This work aims at generating captions for soccer videos using deep learn...

Generating Video Descriptions with Topic Guidance

Generating video descriptions in natural language (a.k.a. video captioni...

1 Introduction

Describing visual content in a natural-language utterance is an emerging interdisciplinary problem, which lies at the intersection of computer vision (CV) and natural language processing (NLP) (

(27)). As a sentence-level short image caption ((38, 35, 1)) has a limited descriptive capacity, (20)

introduce a paragraph-level caption method that aims to generate a detailed and coherent paragraph for describing an image in a finer manner. Recent advances in image paragraph generation focus on building different types of hierarchical recurrent neural network (HRNN),

LSTM ((14)

), to generate the visual paragraphs. For HRNN, the high-level RNN recursively produces a sequence of sentence-level topic vectors given the image features as the input, while the low-level RNN is subsequently adopted to decode each topic vector into an output sentence. By modeling each sentence and coupling the sentences into one paragraph, these hierarchical architectures often outperform the flat models (


). To improve the performance and generate more diverse paragraphs, advanced methods, extending the HRNN based on generative adversarial network (GAN) (

(11)) or variational auto-encoders (VAE) ((18)), are proposed by (23) and (3). Apart from adopting the output of the high-level RNN to represent the topics, (36) introduce convolutional auto-encoding (CAE) on the region-level features of an image to learn the corresponding topics, which are further integrated into the HRNN-based paragraph generation framework. Despite their effectiveness, a common defect of the above image paragraph captioning methods is that there is no clear evidence that the outputs of the high-level RNN or CAE could describe the image’s main topic well. In this case, these models may attend to some image regions that are visually salient but semantically irrelevant with the image’s main topic ((43)

). Motivated by some models that utilize semantic topics learned from topic models rather than the deterministic neural networks to generate single-sentence captions (

(10, 43, 39)), (25) extract textual topics of images with a commonly used shallow topic model, i.e., Latent Dirichlet Allocation (LDA) ((2)), and generate topic-oriented multiple sentences to describe an image.

Motivated by the successes in integrating textual topic information into image captioning tasks, we present a flexible hierarchical-topic-guided image paragraph generation framework, coupling a visual extractor with a multi-stochastic-layer deep topic model to guide the generation of a language model (LM). Specifically, a convolutional neural network (CNN) coupled with a region proposal network (

(30)) is first utilized to detect a set of salient image regions as the visual extractor, which is an usual practice in image captioning systems. Since having an intuition about the image’s semantic topics may help select the most semantically meaningful and topic-relevant image areas as the context for caption generation ((43)), we construct a deep topic model to match the image’s visual features to its corresponding semantic topic information. Here we design a deep topic model built on the success of Poisson gamma belief network (PGBN) ((42)), which can be equivalently represented as deep LDA ((7)) and extract interpretable multi-layer topics from text data. To capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images, we here generalize PGBN into a novel visual-textual coupling model (VTCM). Generally, VTCM encapsulates region-level features into the hierarchical topics by a variational encoder and feeds the topic usage information to the decoder (PGBN) to generate descriptive captions. Different from existing image paragraph caption methods that compute topic vectors via deterministic neural networks (RNN or CAE) or utilize the shallow topic models (typically a two-stage manner) to learn the topics of the given image, our proposed VTCM can relate semantic topics and visual concepts and distill multi-stochastic-layer topic information with an “end-to-end” manner.

To guide paragraph generation, both the visual features and mined hierarchical semantic topics from the VTCM are fed into either an LSTM or Transformer ((33)) based language generator. We refer to them as VTCM-LSTM and VTCM-Transformer. Following (36) and (1), the LM in VTCM-LSTM capitalizes on both paragraph-level and sentence-level LSTMs. In this case, the feedback of the paragraph-level LSTM is fed into the attention module together with the topic information to select critical image visual features. The sentence-level LSTM generates a sequence of words conditioning on the learned topics and attended image features. For Transformer-based image caption systems, while the original Transformer architecture can be directly adopted as the LM in our framework, the multi-modal nature of image captions demands for specialized architectures different from those employed for the understanding of a single modality. (8)

thus introduce a Meshed-Transformer with memory for image captioning, which learns a multi-level representation of the relationships between image regions via an augmented-memory encoder and uses a mesh-like connectivity at decoding stage to exploit both low- and high-level features. In this work, we aim to improve the Meshed-Transformer with the multi-stochastic-layer topic information, which is hierarchically coupled with the visual features extracted by the encoder and further interpolated into the feedback of each decoding layer to guide the caption generation. Absorbing the multi-layer semantic topics as additional guidance, both VTCM-LSTM and VTCM-Transformer produce a caption closely related to the given image and semantic topics. Unlike previous works that adopt GAN or VAE specially as the framework to generate diverse captions for a given image (

(23, 3)), our paragraph captioning systems can generate the diverse paragraph-level captions for an image since we feed the multi-stochastic-layer latent topic representation of VTCM as the source of randomness to the language generator. Moving beyond existing methods that often use pre-trained topics learned from LDA ((10, 25, 39, 4)), Moreover, by designating different topics as high-level guiding information, our proposed framework is controllable, making the generated captions not only be related to the image but also reflect what the user wants to emphasize. Our main contributions include: 1) VTCM is proposed to extract and relate the hierarchical semantic topics with image features, with the distilled topics integrated into both LSTM-based and Transformer-based LMs, guiding paragraph-level caption generation; 2) A variational inference based end-to-end training is introduced to jointly optimize the VTCM and LM, which helps better capture and relate the visual and semantical concepts; 3) Extensive experiments are performed, with the quantitative and qualitative results showing the benefits of extracting multi-layer topics for generating descriptive paragraphs.

2 Related Work

2.1 Image Paragraph Captioning

Regions-Hierarchical ((20)) designs a hierarchical RNN to produce a generic paragraph for an image. To generate diverse and semantically coherent paragraphs, (23) extend the hierarchical RNN by proposing an adversarial framework between structured paragraph generator and multi-level paragraph discriminators. Considering the difficulties associated with training GANs and deficiency of explicit coherence model, (3) augment hierarchical RNN with coherence vectors and a formulation of Variational Auto-Encoders (VAE). To encapsulate region-level features of an image into the topics, (36) design a convolutional auto-encoding (CAE) module for topic modeling, where the extracted topics are further integrated into a two-level LSTM-based paragraph generator. Motivated by some models that utilize semantic topics to generate single-sentence captions ((10, 43, 39)), the authors in (25) pre-train the LDA ((2)

) from the caption corpus of the training images at the first step, then train a topic classifier for semantic regularization and topic prediction based on the learn topics. In short, most of these existing paragraph captioning models focus on constructing topics via deterministic high-level RNN or convolutional encoder, or learning the single-layer topics of images with a pre-trained LDA, in a two-stage manner. In this work, we focus on distilling the multi-stochastic-layer semantic topics based on a deep topic model and feeding them into the commonly used LMs.

2.2 Language Models and Topic Models

Our framework integrates both LMs and topic models (TMs). Existing LMs are often built on either recurrent units, as used in recurrent neural networks (RNNs) (6, 14), or purely the attention mechanism based modules, as used in the Transformer and its various generalizations (33, 29). RNN-based LMs have been successfully used in image paragraph captioning systems (discussed above), while single-sentence Transformer-based image captioning systems have started to attract attentions (16, 22, 8), not to mention the research on the image paragraph captioning. Different from LMs capturing the distribution of a word sequence, TMs (2, 32, 15, 42) often represent each document as a bag of words (BoW), capturing global semantic coherency into semantically meaningful topics. As a deep extension of LDA ((7)), PGBN ((42)) can extract hierarchical topics and capture the relationship between latent topics across multiple stochastic layers, while providing rich interpretation. Here, we generalize PGBN to build the VTCM to mine the relation between visual features and semantic topics, which are assimilated into the LSTM-based or Transformer-based LM in support of paragraph generation. Although the idea of matching visual features and semantic is similar to the work in (41), where the topics are fed into a GAN-based image generator, here we focus on integrating the learned hierarchical topics from VTCM into the LMs to guide the paragraph generation, a task distinct from image generation.

3 Proposed Models

Denoting Img as the given image, image paragraph captioning systems aim to generate a paragraph consisting of sentences, where sentence consists of words from a vocabulary of size . We introduce a plug-and-play hierarchical-topic-guided image paragraph captioning system, with the overview of VTCM-LSTM depicted in Fig. 1. It consists of three major components, including a visual extractor for extracting image features, the proposed VTCM for distilling multi-layer topics of a given image, and the LM for interpreting the extracted image features and topics into captions. Following an usual practice in image captioning systems, we implement the visual extractor by adopting a CNN coupled with a region proposal network (RPN) ((30)). The process is expressed as , where denotes the visual extractor, the number of regions, and the -th salient region of Img. To further compactly describe the content of the image, we subsequently aggregate these vectors into a single average-pooled vector . Below, we give more details about the other two components, the VTCM and LM.

Figure 1: Architecture of our proposed VTCM-LSTM. (a) The visual extractor consisting of a CNN and a RPN, produces feature vectors and average-pooled vector . (b) The visual-textual coupling model, where the right part (from to Bow of the paragraph caption) is the generative model with a three-hidden-layer (decoder) and left (from the average-pooled vector to the and ) is the variational encoder. (c) The LSTM-based LM, including paragraph-level LSTM, attention module and sentence-level LSTM, where is the -th word in the -th sentence of a paragraph, the word embedding matrix.

3.1 Visual-Textual Coupling Model

There are two mainstream ways to learn topics from an image. One is to encode the visual image features into a global vector with a high-level RNN, which is used as the topic information to guide a low-level RNN. As a two-stage method for learning topics, another way is firstly applying LDA on the training caption data and then training a downstream topic classifier over image features, where the semantic image-topic pair is assimilated into the LM for sentence or paragraph generation. We design an end-to-end variational deep topic model to capture the correlations between image features and descriptive text by distilling the semantic topics, jointly trained with the LM. The basic idea follows the philosophy that the generation from topics to descriptive captions via topic decoder and the topics extraction over image visual features via variational encoder can enforce the mined multi-layer topics relate to the visual features.
Topic Decoder: As a multi-stochastic-layer deep generalization of LDA ((7)), PGBN ((42)) is selected as the topic decoder. For the given Img in the training set, we summarize its ground-truth paragraph caption into a bag of words (BoW) count vector , where is the size of the vocabulary excluding stop words, denotes non-negative integers, and each element of counts the number of times the corresponding word occurs in the paragraph. As shown in Fig. 1(b), the generative process of PGBN with -hidden-layer from top to bottom, is expressed as


where the shape parameters of gamma distributed hidden units

are factorized into the product of connection weight matrix and hidden units of the next layer. We place a Dirichlet prior on each column of at each layer. The global semantics of image captions in training dataset are compressed into , and the topic at hidden layer can be visualized as , which is very specific in the bottom layer and become more general in higher layers. denotes a semantic representation of input , indicating its topic proportion at -th layer. It provides a desired opportunity to build a better paragraph generator by coupling the semantically meaningful latent hierarchy with images features.
Variational Topic Encoder: Given the topics shared by all training captions, the topic weights inferred from descriptive caption is not suitable for image caption task, where only images are given during testing stage. To match visual features to the hierarchical topic weight vectors , we adopt the variational hetero-encoder in (41) to build a topic encoder as , with


where the Weibull distribution is used to approximate the gamma distributed conditional posterior, and its corresponding parameters are deterministically nonlinear transformered from the image features , as described in the Appendix A.1 and shown in Fig. 1(b). By transforming standard uniform noises, we can sample as


We denote

as the set of encoder parameters, which can be updated via stochastic gradient descent (SGD) by maximizing a lower bound of the log marginal likelihood of caption

in (3.1), formulated as


Optimizing the above lower bound will encourage the multi-stochastic-layer topic weight vectors to capture holistic and representative information from the image and its corresponding caption. Serving as the bridge between two modalities, hierarchical topics are further used to guide the caption generation in language model, which can be flexibly chosen for our plug-and-play system. Below we investigate how to integrate the topic information into not only LSTM-based LMs, but also Transformer-based LMs.

3.2 LSTM-based Language Generation Model

Inspired by (36) integrating the topics learned from CAE into a two-level LSTM-based paragraph generation framework with the attention mechanism in (1), we construct a paragraph generator with a hierarchy constructed by a paragraph-level LSTM, a sentence-level LSTM, and an attention module, shown in Fig. 1(c). The paragraph-level LSTM first encodes the semantic regions based on all previous words into the paragraph state. Then the attention module selects semantic regions with the guidance of the current paragraph state and semantic topics of the image. Finally, the sentence-level LSTM incorporates the topics, attended image features, and current paragraph state to facilitate word generation.
Paragraph-level LSTM: To generate as the -th word of the -th sentence in a paragraph caption, we set as the input vector of the paragraph-level LSTM. By concatenating the previous output of the sentence-level LSTM at layer , the image feature , and previously generated word , the is formulated as , where indicates concatenation, is a word embedding matrix, the vocabulary size of LM, and the embedding size. This input provides paragraph-level LSTM the maximum contextual information, capturing both visual semantics of the image and long-range inter-sentence dependency within a paragraph caption (1). Then the hidden state of the paragraph-level LSTM is computed as


where and are set to explore inter-sentence dependency.
Attention Module: Given the paragraph state and the concatenation of multi-layer topic weight vectors , denoted as , we further build an attention module to select the most information-carrying regions of the visual features for predicting , defined as


where is the -th element of , and , , , are the learned parameters. Defined in this way

is a probability vector over all regions in the image. The attended image feature is calculated with

, providing a natural way to integrate the multi-layer image topics as auxiliary guidance when generating attention.
Sentence-level LSTM: The input vector to the sentence-level LSTM at each time step consists of the output of the paragraph-level LSTM, concatenated with the attended image feature , stated as . Specifically, sentence-level LSTM in turn produces a sequence of hidden states at layer , one for each sentence in the paragraph, stated as


where is the coupling vector combining the topic weight vectors and hidden output of the sentence-level LSTM at each time step . Following (13), we realize

with a gating unit similar to the gated recurrent unit (

(5)), described in Appendix A.2. The probability over words in the dictionary can be predicted by taking a linear projection and a softmax operation over the concatenation of across all layers. This method enhances the representation power and, with skip connections from all hidden layers to the output (12)

, mitigates the vanishing gradient problem. We denote the parameters of the LSTM-based LM as


3.3 Transformer-based Language Generation Model

We also explore how to integrate the multi-layer topic information into existing Transformer-based LMs, due to their representation power and computational efficiency coming from pure attention mechanisms. Typically, attention operates on a set of queries , keys , and values , defined as , where is a matrix of query vectors, and both contain keys and values, all with the same dimensionality, and is a scaling factor. On the basis of the Transformer-based architecture designed for image captioning ((8)), our LM is conceptually divided into an encoder and a decoder module, shown in Fig. 2. The encoder processes region-level image features and devises the relationships between them, and the decoder reads from the output of each encoding layer and topic information to generate the paragraph caption word by word, both made of stacks of attentive layers.
Memory-Augmented Encoder: Denoting the aforementioned set of features as for clarity, we adopt a memory-augmented attention operator to encode image regions and their relationships, defined as


where are matrices of learnable weights, an usual practice in original Transformer, and and are additional keys and values implemented as plain learnable memory matrices. Following the implementation of (33), the memory-augmented attention can be applied in a multi-head fashion, whose output can be fed into a feed-forward layer, denoted as

. Both attention and feed-forward layer are encapsulated within a residual connection and a layer norm operation, denoted as

. For the encoder with layers, its -th encoding layer is therefore defined


where and . A stack of encoding layers will produce a multilevel output .
Meshed Decoder: Given the region encodings and topic information , our decoder is focused on generating the paragraph caption, denoted as consisting of words for clarity. Inspired by (8), we construct Meshed Attention operator to connect to all elements in and hierarchically through gated cross-attentions, formulated as


where we combine the topic information and the hidden output of the memory-augmented encoder at each layer although other choices are also available, is projected from into the encoder embedding space, and stands for the cross-attention, computed using queries from the decoder and keys and values from the encoder and topic information:


By computing , we can measure the relevance between cross-attention results, where is the learned weight matrix. Similar to the encoding layer, the final structure of each decoding layer is written as


where is a masked self-attention used in traditional Transformer (33), due to the prediction of a word should only depend on previously predicted words, and . After taking a linear projection and a softmax operation over , our decoder finally predicts the a probability over words in the dictionary. Similar with the LSTM-based LM, Transformer-based LM is also guided by the topic and attended image features when generating the caption, whose parameters are represented as .

Figure 2: The overview of VTCM-Transformer, where VTCM is the same topic model used in VTCM-LSTM and omitted here.

3.4 Joint Learning

Under the deep topic model described in Section 3.1 and LSTM-based LM in Section 3.2, the joint likelihood of the target ground truth paragraph of Img and its corresponding BoW count vector is defined as


which is similar with the likelihood of topic-guided Transformer-based captioning system, described in Appendix A.3. As discussed in Section 3.1, we introduce a variational topic encoder to learn the multi-layer topic weight vectors in (2) with the image features as the input. Thus, a lower bound of the log of (3.4) can be constructed as


which unites the first two terms primarily responsible for training the topic model component, and the last term for training the LM component. The parameters of the variational topic encoder and the parameters of LSTM-based LM can be jointly updated by maximizing . Besides, the global parameters of the topic decoder can be sampled with TLASGR-MCMC in (7). The training strategy is outlined in Algorithm 1.

To sum up, as shown in Fig. 1, the proposed framework couples the topic model (VTCM) with a visual extractor, which takes the visual features of the given image as input and maps the hierarchical topic weight vectors. The learned topic vectors in different layers are then used to reconstruct the BoW vector of the given image paragraph caption and as the additional features for the LSTM-based LM to generate the paragraph generation. Moreover, our proposed model introduces randomness into the topic weight vector, which captures the uncertainty about what is depicted in an image and hence encourages the diversity of generation.

  Set mini-batch size , the number of layer and the width of layer ; Initialize topic encoder parameters and LSTM-based parameters and topic decoder parameters .
  for   do
     Randomly select a mini-batch of images and their paragraph captions to form a subset .Compute the image features with visual extractor;Draw random noise

from uniform distribution and sample latent states

from (3) via , which are fed into the LSTM with (3.2) and (7); Compute and according to (3.4), and update and ;
     Update with , described in Appendix A.4;
  end for
Algorithm 1 Inference for our proposed VTCM-LSTM.

4 Experiments

We conduct experiments on the public Stanford image-paragraph dataset ((20)), where 14,575 image-paragraph pairs are used for training, 2,487 for validation, and 2,489 for testing. Following the standard evaluation protocol, we employ the full set of captioning metrics: METEOR ((9)), CIDEr ((34)) and BLEU ((28)). Different from the BLEU scores primarily measuring just the -gram precision, METEOR and CIDEr are known to provide more robust evaluations of language generation algorithms ((34)). Thus in our experiments, the hyper-parameters and model checkpoints are chosen by optimizing the performance based on the average of METEOR and CIDEr scores on the validation set.
Implementation Details: Following the publicly available implementation of (1) and (36), we use Faster R-CNN ((30)) with VGG16 network ((31)) as the visual extractor, which is pre-trained over Visual Genome ((21)). The top detected regions are selected to represent image features. The size of each image feature vector is , which is embedded into the -dimensional vector before fed into our topic model. For our LMs, we tokenize words and sentences using Stanford CoreNLP ((24)), lowercase all words, and filter out words that occur less than time. We set the maximum number of sentences in each paragraph as and the maximum length of each sentence as

(padded where necessary) for VTCM-LSTM. For our topic model, all the words from the training dataset, excluding the stopwords and the top

most frequent words, are used to obtain a BoW caption for the corresponding image. The hidden sizes of paragraph-LSTM, sentence-LSTM, and attention module are all set to . For our Transformer-based LM, we set the dimensionality of each layer to , the number of heads to , and the number of memory vectors to . Both our Transformer-based and LSTM-based LMs are a three-layer model, same with the topic model with the topic number of . We use the Adam optimizer ((19)) with learning rate for VTCM-LSTM and for VTCM-Transformer. The gradients of both VTCM-LSTM and VTCM-Transformer are clipped if the norm of the parameter vector exceeds . The dropout rate is set to for VTCM-LSTM and 0.9 for VTCM-Transformer, and adopted in both input and output layer to avoid the overfitting. During inference, we adopt penalty on trigram repetition proposed by (26)

and set the penalty hyperparameter as

Baselines: For fair comparison, we consider the following baselines: 1) Image-Flat ((17)), directly decoding a paragraph word-by-word via a single LSTM; 2) Flat-repetition-penalty ((26)), training the non-hierarchical LSTM-based LM with an integrated penalty on trigram repetition to improve the diversity in image paragraph captioning; 3) Regions-Hierarchical ((20)), using a hierarchical LSTM to generate a paragraph, sentence by sentence; 4) RTT-GAN ((23)), training the Regions-Hierarchical in a GAN setting, coupled with an attention mechanism; 5) TOMS ((25)), generating multi-sentences under the topic guidence, which trains a downstream topic classifier to predict the topics mined by the LDA; 6) Diverse-VAE ((3)), leveraging coherence vectors and global topic vectors to generate paragraph, under a VAE framework. 7) IMAP ((37)), proposing an interactive key-value memory-augmented attention into the hierarchical LSTM; 8) LSTM-ATT ((1, 36)), one degraded version of VTCM-LSTM, which models topics via LSTM instead of our proposed VTCM and adopts the same two-level LSTM architecture with attention; 9) -Transformer ((8)), which is a novel Transformer-based architecture for single-sentence image captioning and a degraded version of VTCM-Transformer.

Image-Flat ((17)) 12.82 11.06 34.04 19.95 12.20 7.71
Flat-repetition-penalty ((26)) 15.17 22.68 35.68 22.40 14.04 8.70
TOMS ((25)) 18.6 20.8 43.1 25.8 14.3 8.4
Regions-Hierarchical ((20)) 15.95 13.52 41.90 24.11 14.23 8.69
RTT-GAN ((23)) 17.12 16.87 41.99 24.86 14.89 9.03
Diverse-VAE ((3)) 18.62 20.93 42.38 25.52 15.15 9.43
IMAP ((37)) 16.56 20.76 42.38 25.87 15.51 9.42
LSTM-ATT ((1, 36)) 17.40 20.11 40.8 24.75 14.81 8.95
Our VTCM-LSTM 17.52 22.82 42.80 25.50 15.69 9.63
Transformer ((8)) 15.4 16.1 37.5 22.3 13.7 8.4
Our VTCM-Transformer 16.88 26.15 40.93 25.51 15.94 9.96
Human 19.22 28.55 42.88 25.68 15.55 9.66
Table 1: Main results for generating paragraphs. Our models are compared with competing baseline models along six language metrics. The human performance is included for providing a better understanding of all metrics following (20).

4.1 Quantitative Evaluation

Main Results: The results of different models on the Stanford dataset are shown in Table  1, where we only report the results of different models trained with cross-entropy rather than self-critical sequence training for eliminating the influence of different training strategies. As it can be observed, our proposed VTCM-LSTM surpasses all the other LSTM-based captioning systems in terms of BLEU-4, BLEU-3, and CIDEr, while being competitive on BLEU-1 and BLEU-2 with the best performer, and slightly worse on METEOR with respect to RTT-GAN (23). Moreover, on all metrics, our proposed VTCM-LSTM and VTCM-Transformer improve their corresponding baselines, , LSTM-ATT and -Transformer, respectively. These results demonstrate the effectiveness of integrating the semantic topics mined from VTCM into language generation in terms of topical semantics and descriptive completeness. Moreover, VTCM-Transformer leads to a performance boost over VTCM-LSTM on almost all the metrics, indicating the advantage of the memory-augmented operator and meshed cross-attention operator with a Transformer-like layer. We also replace the -Transformer with the original Transformer pretrained on a diverse set of unlabeled text, which however produces poor performance, suggesting the importance of designing specialized architectures for multi-model image captioning. Of particular note is the large improvement under both VTCM-LSTM and VTCM-Transformer on CIDEr, which is proposed specifically for image descriptions evaluation and measures the

-gram accuracy by term-frequency inverse-document-frequency (TF-IDF). Interestingly, by bridging the visual features to the textual descriptions, our proposed VTCM is suited for extracting paragraph-level word concurrence patterns into latent topics, which capture the main aspects of the scene and image descriptions. The assimilation of topic information into language models thus leads to a large improvement in CIDEr, correlated well with human judgment. However, it is often not the case in other image captioning systems unless the CIDEr score is treated as the reward and directly optimized with policy-gradient based reinforcement learning techniques to finetune the model (

(36, 8, 26, 37)).
Ablation study: Firstly, we investigate the impact of topic layers on captioning performance. As it can be seen in Table 2, our proposed VTCM-LSTM can produce desired improvement as its number of layers increases, showcasing the benefits of extracting multi-layer semantic topics for generating descriptive paragraphs. In addition, to evaluate the effectiveness of our way for integrating the topic information into the LMs, we provide two simple variants of our propose models, i.e., Topic+LSTM and Topic+Transformer where the topic information is directly concatenated to the output of ahead of the softmax at each time step, based on our adopted hierarchical LSTM and -Transformer. The proposed LSTM-based and Tranformer-based models both outperform their corresponding base variants, which clearly indicates the usefulness of our proposed ways of incorporating memory into the language decoding process.

Method METEOR CIDEr B-1 B-2 B-3 B-4
VTCM-LSTM L=1 16.50 19.10 42.39 25.42 15.41 9.33
VTCM-LSTM L=2 16.66 19.98 42.45 25.39 15.46 9.34
VTCM-LSTM L=3 17.52 22.82 42.80 25.50 15.69 9.63
Topic+LSTM L=3 15.47 18.02 41.80 24.61 14.74 9.10
VTCM-Transformer L=1 15.87 22.71 39.61 22.92 14.21 8.65
VTCM-Transformer L=2 16.31 23.86 40.17 23.74 15.01 9.16
VTCM-Transformer L=3 16.88 26.15 40.93 25.51 15.94 9.96
Topic+Transformer L=3 15.66 23.45 38.77 23.14 14.51 8.87
Table 2: Ablation study on Stanford dataset. Here, B-N is short for BLEU-N.
Figure 3: Examples for paragraphs generated by LSTM-ATT, our VTCM-LSTM, Transformer, our VTCM-Transformer, and human-annotated Ground Truth paragraphs on Stanford dataset (better viewed in color, red denotes novel words, blue the key words in generated paragraphs and ground truth paragraphs.)

4.2 Qualitative Evaluation

Generated caption given the image: To qualitatively show the effectiveness of our proposed methods, we show descriptions of different images generated by different methods in Fig. 3. As we can see, all of these paragraph generation models can produce paragraphs related to the given images, while our proposed VTCM-LSTM and VTCM-Transformer can generate coherent and accurate paragraphs by learning to distill the semantic topics from an image via the VTCM module to guide paragraph generation. And the generated descriptions of our models are highly related to the given images in terms of their semantical meanings but not necessarily in keywords in original caption. Taking the first row as the example, our VTCM-Transformer can generate coherent and meaningful paragraphs to describe the image, while capturing meta-concepts like “steam train” and “driving on” based on the scenes including “train” and “smoke”, which are even not described in the ground truth but very related to the whole image. These observations suggest that our VTCM has successfully captured hierarchical semantic topics by matching visual features to descriptive texts with a similar VAE structure, and our proposed ways of assimilating the topic information into LSTM or Transformer can successfully guide the paragraph generation. Similar to Fig. 3, we also provide more examples of generated captions for testing images in Appendix A.5.

Figure 4: Visualization of the learned topics given the test images, where the top words of each topic at layers 3, 2, and 1 are shown in blue, green and yellow boxes respectively. We also present the corresponding ground truth caption for each image, which is not visible at the testing stage.
Figure 5: Generated captions given different topics for an image from Stanford image-paragraph dataset (id = 2349394). The caption generated by VTCM-LSTM properly describes the image with the correct topic information to guide the language generation. To see the influence of topic information, we replace the original topics with two different topic vectors at layer 1, whose indexes are 64 and 13, respectively. The train will be described from different perspectives given different topic vectors.

Learned topics given the image: One of the benefits of introducing topics learned from VTCM is the enhancement of model interpretability. To examine whether the topic model can learn the desired topics from the input image, in Fig. 4, we visualize the learned hierarchical topics with our topic model given the input images from the test set, where each topic in different layers has a list of representative words with decreasing ranks. It is clear that the extracted multi-layer topics are highly correlated with the chosen image and its corresponding text. In other words, the topic model can successfully capture the semantically related topics given the image features. Besides, we can see that the topics become more and more specific when moving from the top layer to the bottom layer. Furthermore, we note that the same scene from different topics, such as the word “brown,” is described from multiple perspectives, making it possible to describe the input image from different topic perspectives. Under the guidance of visual features and the corresponding interpretable hierarchical topics, our model can thus produce a more relevant description for the image.
Effect of topics on paragraph generation: We hypothesize that the topic information learned from the image visual features can guide the language paragraph generation model to describe the images. Besides, our model supports the personalized paragraph generation by manipulating the topic information fed into the LSTM-based or Transformer-based LMs. It can be conveniently achieved by replacing the learned topics from a test image with other designated topics. We can make a comparison between the generated captions conditioned on the topics predicted by VTCM and the distorted topics. Fig. 5 illustrates how the generated captions of our proposed model change with the designated topics. The image describes a train on the tracks, and our proposed model identifies its most related topics as #36, #37 and #26 at layer 1, 2 and 3, respectively. However, changing the topics would lead to the different paragraph captions, where we only designate different topics at layer 1 and set other topics as zero for simplicity. Clearly, topic #64 at layer 1 is about the room, and the caption is now changed to regarding the train sitting on the room. Similarly, topic #13 at layer 1 corresponds to the snow and ski, and the caption now is changed to the train covered by the snow. These observations suggest that learned topics have a significant impact on the caption generation semantically. In addition, it is easy for users to select the topics predicted by VTCM or designate other topics since every topic with a list of representative words is interpretable.

Figure 6: Different paragraphs generated by our VTCM-LSTM and VTCM-Transformer by sampling a different topic information each time, given the same set of images.

Diversity: To verify the capacity of our proposed VTCM-LSTM and VTCM-Transformer to generate diverse paragraphs, we example two descriptions with the same set of inputs, respectively. Specifically, we can generate different captions for the same image by just sampling different uniform noises and thus corresponding topic information following Equation (3). As shown in Fig. 6, our proposed models can generate diverse and coherent paragraphs while ensuring the “big picture” underlying the image does not get lost in the details. The reason behind this might be that our frameworks feed the multi-stochastic-layer latent topic representation of VTCM as the source of randomness to the language generator. Therefore, benefiting from the assimilation of multi-stochastic-layer topic information into the language generator, our proposed topic-guided image paragraph captioning systems can guarantee diversity and produce diversified outputs even if there is no specialized module.

Figure 7: Example of generated captions by VTCM-LSTM showing attended image regions. Different regions and the corresponding words are shown in the same color.

The attention mechanism in VTCM-LSTM: To evaluate the effectiveness of the attention mechanism in VTCM-LSTM on the image captioning, in Fig. 7, we visualize the attended image regions with the biggest attention weight for different words. As we can see, our proposed VTCM-LSTM can reason where the model is focusing on at different time steps. Here we take image 1 as an example. When predicting “person”, the attention module precisely chooses the bounding box covering the main part of the body. While predicting the word “snowboard”, our model decides to attend to the surrounding area about the snowboard. It proves that our proposed VTCM-LSTM can capture the alignment between the attended area and the predicted word, which reflects the human intuition during object description.

5 Conclusion

We develop a plug-and-play hierarchical-topic-guided image paragraph generation pipeline, which couples a visual extractor with a deep topic model to guide the learning of a language paragraph generation model. As a visual-textual coupling model, the deep topic model can capture the correlations between the image and text at multiple levels of abstraction and learn the semantic topics from images. Serving as the bridge between two modalities, the distilled hierarchical topics are used to guide the caption generation in language model, where we remould the LSTM-based and Transformer-based LMs, respectively. Experimental results on Stanford paragraph dataset prove that our proposed models outperform a variety of competing paragraph captioning models, while inferring interpretable hierarchical latent topics and generating the semantically coherent paragraphs for the given images.

Appendix A Appendix

a.1 The Variational Topic Encoder of VTCM

Inspired by (40), to approximate the gamma distributed topic weight vector with a Weibull distribution, we assume the topic encoder as , where the parameters and of can be denoted as


where are deterministically nonlinearly transformed from the image pooled representation, stated as and .

a.2 The gating unit in VTCM-LSTM

Note the input of sentence-level LSTM at layer combines the topic weight vectors and hidden output of the sentence-level LSTM at each time step . To realize , we adopt a gating unit similar to the gated recurrent unit (GRU) (5), defined as




Define as the concatenation of across all layers and as a weight matrix with rows, the conditional distribution probability of becomes


There are two advantages to combine at all layers for language generation. First, the combination can enhance representation power because of different statistical properties at different stochastic layers of the deep topic model. Second, owing “skip connections” from all hidden layers to the output one can reduce the number of processing steps between the bottom of the network and the top, mitigating the “vanishing gradient” problem (12).

a.3 Likelihood and Inference of VTCM-Transformer

Given an image Img, we can also represent the paragraph as , which is suitable for flat language model, such as Transformer-based model. Under the deep topic model (VTCM) and Transformer-based LM, the joint likelihood of the target ground truth paragraph of Img and its corresponding BoW count vector is defined as


Since we introduce a variational topic encoder to learn the multi-layer topic weight vectors with the image features as the input. Thus, a lower bound of the log of (A.3) can be constructed as


which unites the first two terms primarily responsible for training the topic model component, and the last term for training the Transformer-based LM component. The parameters of the variational topic encoder and the parameters of Transformer-based LM can be jointly updated by maximizing . Besides, the global parameters of the topic decoder can be sampled with TLASGR-MCMC in (7) and presented below. The training strategy of VTCM-Transformer is similar to that of VTCM-LSTM.

Figure 8: Examples for paragraphs generated by LSTM-ATT, our VTCM-LSTM, Transformer, our VTCM-Transformer, and human-annotated Ground Truth paragraphs on Stanford dataset.

a.4 Inference of Global Parameters of VTCM

For scale identifiability and ease of inference and interpretation, the Dirichlet prior is placed on each column of , which means and . To allow for scalable inference, we apply the topic-layer-adaptive stochastic gradient Riemannian (TLASGR) MCMC algorithm described in (7, 40), which can be used to sample simplex-constrained global parameters (7) in a mini-batch based manner. It improves its sampling efficiency via the use of the Fisher information matrix (FIM), with adaptive step-sizes for the topics at different layers. Here, we discuss how to update the global parameters of VTCM in detail and give a complete one in Algorithm 1.
Sample the auxiliary counts: This step is about the “upward” pass. For the given mini-batch in the training set, is the bag of words (BoW) count vector of paragraph for input image and denotes the latent features of the th image. By transforming standard uniform noises , we can sample as


Working upward for , we can propagate the latent counts of layer upward to layer as


where , , is the size of vocabulary in VTCM and denotes the latent counts at layer .
Sample the hierarchical components : For , the th column of the loading matrix of layer , its sampling can be efficiently realized as


where denotes the learning rate at the th iteration, the ratio of the dataset size to the mini-batch size,

is calculated using the estimated FIM,

and , comes from the augmented latent counts in (23), is the prior of , and denotes a simplex constraint. More details about TLASGR-MCMC for our proposed model can be found in the Equations (18-19) of (7).

a.5 Additional experimental results on Stanford image paragraph dataset.

To qualitatively show the effectiveness of our proposed methods, we additionally show descriptions of different images generated by different methods in Fig. 8.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §1, §1, §3.2, Table 1, §4.
  • [2] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent Dirichlet allocation.

    Journal of Machine Learning Research

    3 (Jan), pp. 993–1022.
    Cited by: §1, §2.1, §2.2.
  • [3] M. Chatterjee and A. G. Schwing (2018) Diverse and coherent paragraph generation from images. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, Vol. 11206, pp. 747–763. Cited by: §1, §1, §2.1, Table 1, §4.
  • [4] F. Chen, S. Xie, X. Li, S. Li, J. Tang, and T. Wang (2019) What topics do images say: a neural image captioning model with topic representation. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 447–452. Cited by: §1.
  • [5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: §A.2, §3.2.
  • [6] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. B. Dzmitry Bahdanau, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.2.
  • [7] Y. Cong, B. Chen, H. Liu, and M. Zhou (2017) Deep latent dirichlet allocation with topic-layer-adaptive stochastic gradient riemannian mcmc. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 864–873. Cited by: §A.3, §A.4, §1, §2.2, §3.1, §3.4.
  • [8] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara (2020) Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578–10587. Cited by: §1, §2.2, §3.3, §4.1, Table 1, §4.
  • [9] M. J. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, June 26-27, 2014, Baltimore, Maryland, USA, pp. 376–380. Cited by: §4.
  • [10] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2321–2334. Cited by: §1, §1, §2.1.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [12] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In ICASSP, pp. 6645–6649. Cited by: §A.2, §3.2.
  • [13] D. Guo, B. Chen, R. lU, and M. Zhou (2020) Recurrent hierarchical topic-guided rnn for language generation. In ICML, Cited by: §3.2.
  • [14] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1, §2.2.
  • [15] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley (2013) Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: §2.2.
  • [16] L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634–4643. Cited by: §2.2.
  • [17] A. Karpathy and F. Li (2015) Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3128–3137. Cited by: Table 1, §4.
  • [18] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1.
  • [19] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.
  • [20] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–325. Cited by: §1, §2.1, Table 1, §4.
  • [21] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §4.
  • [22] G. Li, L. Zhu, P. Liu, and Y. Yang (2019) Entangled transformer for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928–8937. Cited by: §2.2.
  • [23] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing (2017) Recurrent topic-transition GAN for visual paragraph generation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 3382–3391. Cited by: §1, §1, §2.1, §4.1, Table 1, §4.
  • [24] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky (2014) The stanford corenlp natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations, pp. 55–60. Cited by: §4.
  • [25] Y. Mao, C. Zhou, X. Wang, and R. Li (2018) Show and tell more: topic-oriented multi-sentence image captioning. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden

    pp. 4258–4264. Cited by: §1, §1, §2.1, Table 1, §4.
  • [26] L. Melas-Kyriazi, A. M. Rush, and G. Han (2018) Training for diversity in image paragraph captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 757–761. Cited by: §4.1, Table 1, §4.
  • [27] V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni, M. Mitchell, K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge, A. Mensch, et al. (2016) Large scale retrieval and generation of image descriptions. International Journal of Computer Vision 119 (1), pp. 46–59. Cited by: §1.
  • [28] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. Cited by: §4.
  • [29] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.2.
  • [30] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 91–99. Cited by: §1, §3, §4.
  • [31] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Cited by: §4.
  • [32] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei (2006) Hierarchical Dirichlet processes. Publications of the American Statistical Association 101 (476), pp. 1566–1581. Cited by: §2.2.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008.. Cited by: §1, §2.2, §3.3.
  • [34] R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 4566–4575. Cited by: §4.
  • [35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1.
  • [36] J. Wang, Y. Pan, T. Yao, J. Tang, and T. Mei (2019) Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 940–946. Cited by: §1, §1, §2.1, §3.2, §4.1, Table 1, §4.
  • [37] C. Xu, Y. Li, C. Li, X. Ao, M. Yang, and J. Tian (2020) Interactive key-value memory-augmented attention for image paragraph captioning. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 3132–3142. Cited by: §4.1, Table 1, §4.
  • [38] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Vol. 37, pp. 2048–2057. Cited by: §1.
  • [39] N. Yu, X. Hu, B. Song, J. Yang, and J. Zhang (2018) Topic-oriented image captioning based on order-embedding. IEEE Transactions on Image Processing 28 (6), pp. 2743–2754. Cited by: §1, §1, §2.1.
  • [40] H. Zhang, B. Chen, D. Guo, and M. Zhou (2018)

    WHAI: Weibull hybrid autoencoding inference for deep topic modeling

    In ICLR, Cited by: §A.1, §A.4.
  • [41] H. Zhang, B. Chen, L. Tian, Z. Wang, and M. Zhou (2020) Variational hetero-encoder randomized generative adversarial networks for joint image-text modeling. In International Conference on Learning Representations, Cited by: §2.2, §3.1.
  • [42] M. Zhou, Y. Cong, and B. Chen (2016) Augmentable gamma belief networks. Journal of Machine Learning Research 17 (163), pp. 1–44. Cited by: §1, §2.2, §3.1.
  • [43] Z. Zhu, Z. Xue, and Z. Yuan (2018) Topic-guided attention for image captioning. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2615–2619. Cited by: §1, §1, §2.1.