Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation

08/01/2019 ∙ by Yadan Luo, et al. ∙ The University of Queensland 0

Visual paragraph generation aims to automatically describe a given image from different perspectives and organize sentences in a coherent way. In this paper, we address three critical challenges for this task in a reinforcement learning setting: the mode collapse, the delayed feedback, and the time-consuming warm-up for policy networks. Generally, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework to jointly enhance the diversity and accuracy of the generated paragraphs. First, by modeling the paragraph captioning as a long-term decision-making process and measuring the prediction uncertainty of state transitions as intrinsic rewards, the model is incentivized to memorize precise but rarely spotted descriptions to context, rather than being biased towards frequent fragments and generic patterns. Second, since the extrinsic reward from evaluation is only available until the complete paragraph is generated, we estimate its expected value at each time step with temporal-difference learning, by considering the correlations between successive actions. Then the estimated extrinsic rewards are complemented by dense intrinsic rewards produced from the derived curiosity module, in order to encourage the policy to fully explore action space and find a global optimum. Third, discounted imitation learning is integrated for learning from human demonstrations, without separately performing the time-consuming warm-up in advance. Extensive experiments conducted on the Standford image-paragraph dataset demonstrate the effectiveness and efficiency of the proposed method, improving the performance by 38.4



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With a rapid growth of multimedia data (Luo et al., 2018; Li et al., 2019), understanding the visual content and interpreting it in natural language have been important yet challenging tasks, which could benefit a wide range of real-world applications, such as story telling (Huang et al., 2016; Wang et al., 2018a; Ravi et al., 2018), poetry creation (Liu et al., 2018b, c; Xu et al., 2018; Yang et al., 2018a) and support of the disabled.

Figure 1. An illustration of the mode collapse issue in paragraph captioning. Baseline model describes two distinct images with generic descriptions and similar patterns (as highlighted in blue), whereas the proposed CRL method generates diverse and more specific captions to context (as highlighted in orange).

While deep learning techniques have made remarkable progress in describing visual content via image captioning

(Liu et al., 2018a; Rennie et al., 2017; Zhang et al., 2017; Dai et al., 2017), the obtained results are generally sentence-level, with fewer than twenty words. Generating such a short description could hardly convey adequate messages for all subtle objects and relationships, to say nothing of further pursuing coherence and diversity. Consequently, it is more natural to depict images in paragraph format (Yu et al., 2016; Li et al., 2015; Krause et al., 2017; Liang et al., 2017; Chatterjee and Schwing, 2018), which has been investigated recently.

Generally, existing mainstream paragraph captioning models (Krause et al., 2017)

follow the encoder-decoder architecture, where the language decoder is fully supervised to maximize the posterior probability of predicting each word, given the previous ground-truth sequence and image representations extracted by the visual encoder. The word-level cross-entropy objective, in this way, will encourage the use of exactly the same

n-grams that appear in ground-truth samples, making the paragraph captions lack completeness and consistency. Motivated to achieve more diverse and natural descriptions, an emerging line of work (Liang et al., 2017; Dai et al., 2017; Chatterjee and Schwing, 2018)

combines supervised learning with generative adversarial models 

(Liang et al., 2017; Dai et al., 2017) or auto-encoders (Chatterjee and Schwing, 2018), aiming to capture inherent ambiguity of captions with a low-dimensional Gaussian manifold and model the structure of paragraphs in a hierarchical way. Nevertheless, existing paragraph captioning approaches are far from optimal due to two major issues, i.e., the mode collapse and the exposure bias. First, the simple Gaussianity assumption is not sufficient enough to fully preserve the ground-truth distribution, many modes of which are underrepresented or missing. For instance, given two distinct pictures of baseball games as Figure 1 shows, the generated paragraphs by the baseline model only describe objects and actions with vague and general words (e.g., “two men”, “standing”) rather than specific noun entities (e.g., “baseball player”, “catcher”) or vivid verbs (e.g., “squatting”). Second, the language decoder predicts based on different inputs during training and testing, i.e., the ground-truth sub-sequences during training yet its own predictions during testing. This discrepancy as exposure bias severely hurts the model performance.

Recently, another line of work tackles the exposure bias and takes advantage of non-differential evaluation feedback by applying reinforcement learning, especially the REINFORCE (Williams, 1992) algorithm for the sentence-level captioning task (Liu et al., 2018a; Zhang et al., 2017; Rennie et al., 2017; Liu et al., 2017; Chen et al., 2018)

. This strategy reformulates the image captioning as the sequential decision-making process, where the language policy based on its previous decisions is directly optimized. Besides, it is more reasonable to achieve long-term vision and less greedy behaviors by optimizing the sentence-level evaluation metrics such as BLEU 

(Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and CIDEr (Vedantam et al., 2015) instead of the cross entropy loss.

Unfortunately, it is challenging to extend the success to the paragraph captioning task: (1) Mode Collapse: optimizing evaluation metrics still does not alleviate the fixed-pattern issue, where the strategy could be easily tricked by repetition of frequent phrases and generic representations, yielding less variety of expressions; (2) Delayed Feedback:

the language policy only receives feedback from the evaluator when the entire sequence is produced, resulting in high training variance especially with long sequence data like paragraphs; (3)

Time-consuming Warm-up: current reinforcement learning suffers from low sample efficiency, which causes an unbearable time and computational cost for trial and error. Therefore, it usually requires long-term supervised pre-training for policy warm-up.

To address above-mentioned issues, in this paper, we propose a novel Curiosity-driven Reinforcement Learning (CRL) framework for diverse visual paragraph generation. Firstly, we design an intrinsic reward function (the curiosity module) that encourages the policy to explore uncertain behaviors and visit unfamiliar states, thereby increasing the expression variety rather than pursuing local phrase matching. In particular, the curiosity module consists of two sub-networks on top of language policy networks with self-supervision, i.e., the Action Prediction Network (AP-Net) and the State Prediction Network (SP-Net). Moreover, the SP-Net measures state prediction error at each time steps as dense intrinsic rewards, which are complementary to the delayed extrinsic reward. Different from conventional reinforcement learning (Rennie et al., 2017; Liu et al., 2018a) that simply averages the extrinsic reward to each state, we further adopt the temporal difference learning (Sutton et al., 2017) to correct the extrinsic reward estimation, considering the correlations of successive actions. Lastly, to avoid time-consuming warm-up for policy networks, our algorithm seamlessly integrates discounted imitation learning to stabilize learning for fast convergence, and gradually weakens supervision signal without shrinking action space.

Overall, our contributions can be briefly summarized as follows:

  • To our best knowledge, this is the first attempt to tackle the visual paragraph generation problem with pure reinforcement learning. Different from the conventional REINFORCE algorithm, our CRL learning motivates visual reasoning and language decoding both intrinsically and extrinsically.

  • The intrinsic curiosity complements sparse and delayed extrinsic rewards with prediction error measurement, guiding the agent to fully explore and achieve a better policy.

  • Instead of pre-training policy networks with supervised learning, we jointly stabilize reinforcement learning with discounted imitation learning for fast convergence.

  • We show the effectiveness of the proposed strategy through extensive experiments on the Standford paragraph captioning benchmark and demonstrate the diversity of generated paragraphs with a visualization of semantic network graphs.

2. Related Work

2.1. Sentence-level Captioning with Reinforcement Learning

Inspired by the recent advances in reinforcement learning, several attempts have been made to apply policy gradient algorithms to image captioning task (Zhang et al., 2019; Bin et al., 2017; Yang et al., 2018b), which could generally be categorized into two groups: policy based and actor-critic based. Policy based methods (e.g., DISC (Dai et al., 2017), SCST (Rennie et al., 2017), PG-SPIDEr (Liu et al., 2017), CAVP (Liu et al., 2018a), TD (Chen et al., 2018)) utilize the unbiased REINFORCE (Williams, 1992) algorithm which optimizes the gradient of the expected reward by sampling a complete sequence from the model during training. To suppress high variance of Monte-Carlo sampling, Self-critical Sequential Training (SCST) (Rennie et al., 2017) utilizes a baseline subtracted from the return which is added to reduce the variance of gradient estimation. Rather than obtaining a single reward at the end of sampling, actor-critic based algorithms (e.g., Embedded Reward (Ren et al., 2017), Actor-Critic (Zhang et al., 2017), Adapt (Chen et al., 2017), HAL (Wang et al., 2018b)) learn both a policy and a state-value function (“crtic”), which is used for bootstrapping, i.e., updating a state from subsequent estimation, to reduce variance and accelerate learning (Sutton et al., 2017). Different from existing work, the proposed CRL algorithm learns about a critic from the inner environment, complementing the extrinsic reward from the perspective of agent learning.

2.2. Paragraph-level Captioning

While sentence-level captioning has been extensively studied, the problem of generating paragraph-level descriptions still remains under-explored. Existing solutions include (1) generating sentences individually with detected region proposals (DenseCap (Johnson et al., 2016)), or with topic learning via the Latent Dirichlet Allocation (TOMS (Mao et al., 2018)); (2) preserving semantic content and linguistic order with hierarchical structure (Region-Hierarchical (Krause et al., 2017)). To further encourage the coherence and naturalness among successive sentences, this model was further extended by Liang et al. (Liang et al., 2017) and Dai et al. (Dai et al., 2017) by adopting adversarial learning. To tackle the training difficulties of Generative Adversarial Networks (GANs), Chatterjee et al. (Chatterjee and Schwing, 2018) modeled the inherent ambiguity of paragraphs via a variational auto-encoder formulation. From another perspective, Wang et al. (Wang et al., 2018c) leveraged depth estimation to discriminate objects at various depths and capture subtle interactions.

2.3. Intrinsically Motivated Reinforcement Learning

In reinforcement learning area, much theoretical work has been done on improving agent exploration and shaping sparse rewards via intrinsic motivation (Singh et al., 2004), where an information-theoretic critic measures the agent’s surprisal (Achiam and Sastry, 2017; Pathak et al., 2017; Burda et al., 2018a, b) (based on prediction error) or state novelty (Bellemare et al., 2016; Lopes et al., 2012; Tang et al., 2017; Houthooft et al., 2016) (based on counts of visited states), motivating the agent from the inner environment. Our proposed algorithm shares the same spirit with the former group. But instead of testing on simulated games, we, for the first time, adapt the intrinsic reward and validate its effectiveness and efficiency on a more practical task, i.e., paragraph captioning.

Figure 2. The general flowchart of the proposed image paragraph captioning model.

3. Curiosity-driven Learning

The overview of the proposed paragraph captioning framework is illustrated in Figure 2. We firstly formulate the task of visual paragraph generation, followed by the introduction of the language policy network and policy learning. To enhance the agent exploration, two sub-networks of the curiosity module are trained in a self-supervised manner: the state prediction network (SP-Net) and the action prediction network (AP-Net). Then, we explain the detailed calculation for rewards with temporal-different learning, discounted imitation learning and collaborative optimization for all objectives.

3.1. Problem Formulation

Our target is to generate the paragraph caption for any given image , where denotes the length of the generated caption and is the vocabulary size. The proposed framework follows the general encoder-decoder structure (see Figure 2), where -dimensional visual features for local regions are extracted by the Faster RCNN (Ren et al., 2015)

at the encoding stage. Language sequences are decoded by Recurrent Neural Networks (RNN) step by step. Different from traditional settings that directly force the language decoder to mimic ground truths by using cross entropy loss, our work casts the problem in reinforcement learning in order to optimize non-differential evaluation metrics (

e.g., CIDEr) and suppress the exposure bias. Playing as the “policy”

in a finite Markov decision process (MDP), the language decoder model parameterized by

predicts the next word (“action” ) at time based on the hidden “state” . To suppress the sparsity and delay issues of the extrinsic reward , the derived curiosity critic predicts the expected “intrinsic reward” at each time step, which greatly complements and shapes the final reward.

3.2. Visual-Language Policy Network

To adaptively control visual signals and generate context-aware descriptions, we adopt a double-layer LSTM structure coupled with attention mechanism as the policy networks (see Figure 2

). The first LSTM layer serves as a top-down visual attention model, taking the input word

, concatenated with the mean-pooled image features and previous -dimensional state of the language LSTM at each time step. Therefore, the state transition for the visual LSTM is,


where , are the learnable weights and is one-hot embedding of the input word at the time step . To further attend local visual features based on language policy, the weighted visual features can be calculated as,


where , , and are the weights to be learned. Obtaining weighted visual features and the hidden state from the attention LSTM, the top-level language LSTM gives the conditional distribution for the next word prediction, i.e.,


where is the learnable weight. To encourage the agent to explore rare attended areas and words, we concatenate the state , which will be used in policy learning and discounted imitation learning. To form the complete paragraph, the distribution can be calculated as the dot product of conditional distributions from previous steps,


3.3. Policy Learning

In reinforcement learning, the policy network leverages the experiences obtained from interacting with an environment to learn behaviors that maximize a reward signal. Generally, the RL loss can be presented as,


where is the advantage function. stands for the state-action function estimating the long-term value instead of the instantaneous reward, and indicates state value function, which serves as the inner critic. The core idea is to incentivize the policy to increase the probability of actions that are correct and rarely-seen. denotes the concatenated hidden states of the policy network. Based on policy gradient (Sutton et al., 2017)

, the gradient of non-differentiable reward-based loss function can be derived as,


3.4. Self-supervised State Prediction (SP-Net)

Before estimating the advantage function, we first detail two sub-networks for the state value function . The SP-Net is trained to predict the future state embedding based on the input action , and , where indicates the state embedding layer and helps filter irrelevant memory for the prediction of next state. The mean-squared error is used as the objective function for SP-Net,


where denotes the nonlinear transformation of SP-Net parameterized by . In this way, the state value function can be obtained as,


where is the hyper-parameter. The prediction error quantifies the agent uncertainty towards the environment. The policy network trained to maximize state prediction error will explore transitions with less experience and high confusion, therefore rare attended areas and infrequent expressions can be well captured.

3.5. Self-supervised Action Prediction (AP-Net)

Given the transition tuple , the action prediction network targets at predicting the action based on state transition. The objective of AP-Net can be defined as,


where is the prediction of current action, shown as a softmax distribution among all possible words. is the real distribution of action and denotes the nonlinear transformation of AP-Net parameterized by

. The intuition of AP-Net is to learn state embedding that corresponds to meaningful patterns of human writing behaviors, suppressing the impact of outliers.

Figure 3. An illustration of the curiosity module. The blue and the green dotted line denote the loss calculation for and , respectively. The orange line shows the calculation for the intrinsic reward .

3.6. Reward Shaping

To encourage an agent to explore its environment for acquiring new knowledge and guide it to generate accurate and diverse paragraphs, the overall reward is generated by two parts, dense intrinsic curiosity reward and sparse extrinsic reward . The policy network is expected to maximize the weighted sum of two rewards.

3.6.1. Extrinsic Reward

To improve the fidelity and interpretability of the learned paragraph, the extrinsic reward is refined as the linear combination of linguistic measures. Specifically, we select the most representative and commonly used metrics, i.e., BLEU-4 (Papineni et al., 2002) and CIDEr (Vedantam et al., 2015),


In our case, the hyper-parameters and are empirically set to and , respectively. Under such a reward setting, we adopt the temporal-difference learning TD() (Sutton et al., 2017) to estimate the action-state function for each time step,


where the -step expected return is defined as the sum of expected future rewards from the next steps. The indicates the trade-off parameter between the future estimation and the current estimation. The discounted factor enables variance reduce by down-weighting extrinsic rewards. For simplicity, we set to 1, and the overall function can be formulated as,


3.6.2. Intrinsic Reward

As discussed in Section 3.4, we train a state prediction network and calculate the prediction error as intrinsic reward (see Equation (8)). Therefore, the gradient of policy network can be rewritten as,


3.7. Discounted Imitation Learning

A major challenge of the reinforced agent to have a good convergence property is that the agent must start with a good policy at the beginning stage. The low sample efficiency issue (Yu, 2018) causes a huge amount of time and computational cost for trial and error. Existing sentence-level captioning methods (Rennie et al., 2017; Liu et al., 2018a, 2017; Zhang et al., 2017) with reinforcement learning apply the cross entropy loss to the language decoder for warm up, which is defined as,



is the human-labeled ground-truth. Despite that supervised learning is essential to initialize the policy network, it usually consumes a long period of time (e.g., 40 epochs on Standford dataset), and highly restricts the agent search space, which probably leads to a local minimum. Therefore, we introduce the discounted imitation learning at the first epoch of training, then gradually decrease the loss coefficient

to weaken the supervision.

3.8. Collaborative Optimization

To collaboratively optimize the objectives of reinforcement learning, the curiosity modules and discounted imitation learning, the overall learning loss function can be formulated as,


where and are constant loss coefficients, is a dynamic scaling factor that gradually reduces with ten percent decay every epoch. Notably, we dynamically estimate the intrinsic reward of agent behavior to shape reward signal, which avoids additional baseline calculation in the advantage function. The overall algorithm is shown in the Algorithm 1.

2:      Annotated image-paragraph set ;
4:      Visual-language policy network ;
6:      Hyper-parameters: , , , , , , ; Visual features from regions; Minibatch size and learning rate ;
7:for k epochs do
8:     for T time steps do
9:         Sample action based on policy
10:     end for
11:     Calculate intrinsic rewards and state-value function in Equation (8);
12:     Calculate extrinsic rewards and action-state function in Equation (12);
13:     Update parameters , , and by descending stochastic gradients:
14:     ;
15:     ;
16:     ;
17:     ;
18:     update the dynamic factor ;
19:end for
Algorithm 1 Pseudo-code of the Proposed CRL Learning.

4. Experiments

4.1. Settings

4.1.1. Dataset

All state-of-the-art methods and our proposed method are evaluated on the Stanford image-paragraph dataset (Krause et al., 2017), where 14,579 image-paragraph pairs from the Visual Genome and MS COCO dataset are used for training, 2,490 for validation and 2,492 for testing. The number of unique words in its vocabulary is 12,186. All images are annotated with human-labeled paragraphs of 67.5 words on average.

4.1.2. Evaluation Metrics

We report the performance of all models on six widely used automatic evaluation metrics, i.e., BLEU-{1,2,3,4} (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and CIDEr (Vedantam et al., 2015). BLEU-

is defined as the geometric mean of

-gram precision scores and CIDEr measures

-gram accuracy by term-frequency inverse-document-frequency (TF-IDF). METEOR is defined as the harmonic mean of precision and recall of exact, stem, synonym, and paraphrase matches between paragraphs.

Methods Language Decoder Beam Search METEOR CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4
Sentence-Concat (Neuraltalk(Karpathy and Li, 2015)) 1*LSTM 2-beam 12.05 6.82 31.11 15.10 7.56 3.98
Sentence-Concat (NIC(Vinyals et al., 2015)) 1*LSTM 2-beam 9.27 7.09 22.31 10.72 4.91 2.32
Image-Flat (NIC(Vinyals et al., 2015)) 1*LSTM 2-beam 13.44 14.71 34.80 19.42 10.91 6.03
DAM-Att ((Wang et al., 2018c)) 2*LSTM Greedy 13.91 17.32 35.02 20.24 11.68 6.57
TOMS ((Mao et al., 2018)) 1*LSTM 3-beam 18.60 20.80 43.10 25.80 14.30 8.40
Region-Hierarchical ((Krause et al., 2017)) 2*LSTM 2-beam 13.85 10.64 35.58 17.94 9.08 4.49
RTT-GAN ((Liang et al., 2017)) 2*LSTM 2-beam 17.12 16.87 41.99 24.86 14.89 9.03
VAE ((Chatterjee and Schwing, 2018)) 2*GRU Greedy 18.62 20.93 42.38 25.52 15.15 9.43
SCST ((Rennie et al., 2017)) 1*LSTM Greedy 16.01 22.74 40.89 23.71 14.43 8.38
CRL 1*LSTM Greedy 17.71 25.03 43.10 26.93 16.65 9.91
CRL 1*LSTM 2-beam 17.42 31.47 43.12 27.03 16.72 9.95
Humans (as in (Krause et al., 2017)) - - 19.22 28.55 42.88 25.68 15.55 9.66
Table 1. Performance comparisons using BLEU-{1,2,3,4}, METEOR and CIDEr on Standford Image-paragraph dataset. The human performance is provided for reference. In our setting, the proposed CRL only optimizes CIDEr and BLEU-4 and achieves the highest scores compared with state-of-the-art.

4.2. Baselines

We compare our approach with several state-of-the-art paragraph captioning methods and one RL-based method.

Sentence-Concat: Two sentence-level captioning models (Neu- raltalk (Karpathy and Li, 2015) and NIC (Vinyals et al., 2015)) pre-trained on the MS COCO dataset are adopted to predict five sentences for each given image, which are further concatenated into a paragraph.

Image-Flat: Different from sentence-concat group, Image-Flat (Vinyals et al., 2015) method directly generates a paragraph word by word, with the ResNet-152 network (He et al., 2016) for visual encoding and a single LSTM layer to decode language. DAM-Att (Wang et al., 2018c) couples the encoder-decoder architecture with attention mechanism, and additionally introduces depth information to enhance recognition to spatial object-object relationships. TOMS (Mao et al., 2018) learns topic-transition among multiple sentences with Latent Dirichlet Allocation (LDA).

Hierarchical: Region-Hierarchical (Krause et al., 2017) leverages a hierarchical recurrent network to learn sentence topic transition and decode language sentence by sentence. RTT-GAN (Liang et al., 2017) implements the hierarchical learning in a GAN (Goodfellow et al., 2014) setting, where the generator mimics the human-annotated paragraphs and tries to fool the discriminator. VAE (Chatterjee and Schwing, 2018) models the paragraph distribution with variational auto-encoder (Kingma and Welling, 2013), which preserves the coherence and global topics of paragraphs. Notably, Liang et al. took advantage of the local phrases that are predicted by the dense-captioning model (Johnson et al., 2016), which additionally used training data from the MS-COCO dataset.

REINFORCE: For fair comparison, we compare the proposed framework with the RL-based image captioning method SCST (Rennie et al., 2017). The model shares the same backbone encoder-decoder structure but a different reinforcement learning strategy and reward functions. As it requires supervised warm-up for policy networks, we pre-train the SCST with the cross-entropy (XE) objective using ADAM optimizer (Kingma and Ba, 2015) with the learning rate of .

Methods Policy ResNet Features Region Features
CRL w/o RL FC 13.10 11.18 36.19 17.67 8.41 3.93 13.33 11.83 37.01 18.32 8.77 4.35
Att 13.77 12.34 37.40 20.94 11.48 6.10 13.35 12.23 36.50 19.11 9.28 4.39
Up-Down 14.02 11.46 37.68 19.17 9.34 4.34 14.28 14.10 38.07 20.42 10.57 5.26
CRL w/o intrinsic FC 15.12 18.31 39.38 21.81 11.84 6.23 15.67 20.14 40.98 23.04 13.44 7.58
Att 15.89 19.07 41.32 24.72 14.04 7.96 15.92 19.12 40.31 24.82 14.36 8.27
Up-Down 15.91 20.45 41.41 24.77 14.40 8.14 16.01 22.74 40.89 23.71 14.43 8.38
CRL FC 15.53 20.31 39.68 22.66 12.69 6.89 15.87 21.13 40.98 24.30 14.12 7.96
Att 16.13 19.67 41.17 24.18 14.92 8.92 16.11 19.21 41.17 25.00 15.03 8.82
Up-Down 16.71 24.99 41.88 25.24 15.25 9.03 17.71 25.03 43.10 26.93 16.65 9.91
Table 2. Ablative performance comparisons on Standford image-paragraph dataset. “w/o” indicates without. The best performances are shown in boldface.

4.3. Implementation Details

Our source code is based on PyTorch

(Paszke et al., 2017) and all experiments are conducted on a server with two GeForce GTX 1080 Ti GPUs.

4.3.1. Data Pre-processing

For textual pre-processing, we first tokenize all annotated paragraphs, and replace words that appear less than five times with the unknown ¡unk¿ token for the vocabulary. For ResNet Features extraction, we encode each image with Resnet-101 (He et al., 2016)

with a 2048-D vector, while we select top

salient regions for Region Features with Faster R-CNN (Ren et al., 2015).

4.3.2. Module Architecture.

The AP-Net maps input state into a state embedding with one fully connected layer and one LeakyReLu layer. The SP-Net takes and 512-D embedding for as input, then passes it into a sequence of two fully connected layers with 512 units and 12,186 units.

4.3.3. Parameter Settings.

The hidden size, all embedding size for images and words are fixed to 512. The batch size for non-attention based models is 32, but 16 for attention-based model. The learning rate is initiated as then decayed by a factor of 0.8 every three epochs. The discounted factor for discounted imitation learning is set to 0.9. The hyper-parameter and discounted coefficient are set to 1 and 0.9, respectively. The loss coefficients and are fixed at 0.2 and 0.8. For compared models, the embedding size of topic vector is set to 100.

4.4. Comparisons with State-of-The-Art

4.4.1. Quantitative Analysis.

In this section, we quantitatively evaluate various paragraph captioning methods using the standard metrics on the Standford image-paragraph dataset. Here we report the best performance for every model, along with the specification of the language model and the search method at inference stage. Greedy denotes greedy search (equals to -beam search) and -beam indicates the beam search with most probable sub-sequence (Ranzato et al., 2016). Generally, more beams used for inference will lead to better performance but higher time-cost. From Table 1, we can observe that our CRL is superior to all the compared paragraph-based and sentence-based image captioning methods in most cases, especially improving CIDEr (Vedantam et al., 2015) by (from to ). With only a single layer language decoder, we achieve a significant performance boost over hierarchical methods. Since we select metrics to optimize paragraph-level quality (e.g., CIDEr), the proposed CRL achieves relatively lower performance on the uni-gram metric with synonymous substitution (e.g., METEOR). Regarding observations on compared methods, Non-hierarchical methods (e.g., Image-Flat (Vinyals et al., 2015), DAM-Att (Wang et al., 2018c) and TOMS (Mao et al., 2018)) perform much better than simple concatenation of sentence-level outputs (e.g., Neuraltalk (Karpathy and Li, 2015), NIC (Vinyals et al., 2015)), yet they fail to capture the overall structure and topic transition of paragraphs, thus obtaining a lower performance than hierarchical approaches (e.g., Region-Hierarchical (Krause et al., 2017), RTT-GAN (Liang et al., 2017) and VAE (Chatterjee and Schwing, 2018)). Different from ‘Region-Hierarchical’ that simply concatenates sentences from the bottom LSTM, ‘RTT-GAN’ and ‘VAE’ preserve a better consistency among sentences. The RL-based method, i.e., SCST (Rennie et al., 2017) with single-layer language decoder achieves competitive outcomes compared with the hierarchical model (Krause et al., 2017), which demonstrates the power of policy optimization. Humans, as reported in (Krause et al., 2017), show the results by collecting additional paragraphs for 500 randomly chosen images. We can see that the results show a large gap between automatic synthetic captions and natural language, whereas our proposed CRL with 2-beam search mitigates the gap and achieves competitive outcomes. Besides, experimental results verify that CIDEr metric align better with human judgment than any other evaluation metrics.

4.4.2. Qualitative Analysis

In order to intuitively understand the performance of the proposed CRL training, we showcase some outputs with greedy search for randomly selected images in Figure 5, i.e., the paragraphs generated by canonical paragraph captioning method ‘Region-Hierarchical’ (Krause et al., 2017), the proposed CRL method and the RL-based method ‘SCST’ (Rennie et al., 2017). With comparisons with counterparts, our proposed CRL model generates the paragraph in a coherent order: the first sentence (in red) tends to cover a global topic or major actions in visual content, followed by several sentences (in blue) to describe the details of the scene. Generally, the last sentence gives descriptions about objects or environment in the background, which exactly matches human writing styles. Notably, our synthetic paragraphs capture more subtle and accurate words and relationships, such as ‘platform’ and ‘standing behind the man’. In contrast, both ‘Region-Hierarchical’ and ‘SCST’ could barely guarantee the completeness and richness of the generated paragraphs.

Figure 4. The curves of language measures on validation set and average rewards on Standford image-paragraph dataset.
Figure 5. Paragraphs generated for the images from the Standford image-paragraph dataset. It is observed that our generated paragraphs associated with logic and coherence, i.e., starting with the global sentence marked in red, followed with the details in blue and the background description in green.
(a) SCST
(b) CRL
(c) Ground-truth
Figure 6. Visualization of diversity of the generated paragraphs by (a) RL-based method SCST (Rennie et al., 2017) (b) our proposed CRL method (c) human-beings with semantic network graphs. Each node with different color indicates the unique token with various part-of-speech (POS) tags. Edges show the proximity relationship between tokens.

4.5. Ablation Study

In this section, we study the impact of the RL strategy, the policy network architecture and visual features, respectively. The major experimental results are shown in Table 2 and the detailed curve of evaluation metrics and average rewards are illustrated in Figure 4.

4.5.1. RL Training Strategy

By comparing the performance of each policy used in different methods shown in Table 2, we can observe that pure supervised learning (CRL w/o RL) drops greatly on aggregative metrics like CIDEr and multi-gram metrics like BLEU-{3,4}. Removing the intrinsic reward, CRL w/o intrinsic tightly follows the learned policy and lacks essential exploration, thus leading to the sub-optimal performance. Moreover, in Figure 4, we show detailed curves of BLEU-, CIDEr per training step and curves of average rewards on validation set based on the ‘Up-Down’ decoder and the ‘Region Features’. From Figure 4, we could observe that ‘CRL w/o intrinsic’ (as the blue line shows) needs a long-period warm-up by ‘CRL w/o RL’ (as the green line shows) and sharply boosts evaluation metrics of the evaluation score after epochs, and gradually converges afterwards. While our strategy CRL

(as the orange line shows) gains a smooth increase during training, since it benefits from the power of combining discounted imitation learning and policy gradient training. Different from ‘CRL w/o intrinsic’, our CRL method avoids tedious and time-consuming initialization, and obtains a full exploration and a better policy network. With respect to the average reward curve, the extrinsic reward achieved by ‘CRL w/o intrinsic’ climbs fast after pre-training, yet our reward signal moves a downhill then slowing uphill. This phenomenon is probably caused by the intrinsic reward, that decreases at very first beginning as the agent learns to control state transition and linguistic patterns. Regarding to variance shown in Figure 4, we infer the variance is mainly introduced by iterative action optimization where the average reward fluctuates correspondingly. The variance can be alleviated by gradient clipping or adjusting learning rate. With the extrinsic rewards that gradually accumulate, the overall reward slowly increases until the model converges. Moreover, it is clearly observed that ‘CRL’ achieves faster convergence (around 25 epochs) compared with ‘CRL w/o intrinsic’ and ‘CRL w/o RL’.

4.5.2. Policy Network Architecture

Regarding to the backbone language decoder, we switch the policy networks from FC (vanilla LSTM (Hochreiter and Schmidhuber, 1997)), to Att (attention-based LSTM (Xu et al., 2015)), to Up-Down (attention-based LSTM + language LSTM (Anderson et al., 2018)) for comparison. By comparing each method under different training policies, it is clear that the ‘Up-Down’ model achieves a higher performance among all metrics, as it dynamically attends on local areas of images and captures more visual details. In particular, the ‘Up-Down’ model trained with the proposed curiosity-driven RL averagely increase the CIDEr score by and , the BLEU-4 score by and compared with the ‘CRL w/o instrinsic’ and ‘CRL w/o RL’, respectively.

4.5.3. Visual Features

In addition to the evaluation considering the impact of visual features, we also list the performance based on ResNet Features and Region Features (see Section 4.3.1 for details). From the Table 2, we can draw the observation that ‘Region Features’ contribute positively to the captioning model, which enriches the visual recognition and representations. Compared with other language networks, the ‘Up-Down’ architecture is more sensitive to the selection of visual features.

4.6. Diversity Analysis

To shed a quantitative light on the linguistic property of generated paragraphs, we randomly select 500 images from the test set of the Stanford image-paragraph dataset, and show the statistics of the paragraphs produced by a representative spread of methods in Figure 6. We visualize the semantic graphs of language distribution with a d3-force package in JavaScript. In each semantic graph, each node indicates a unique token from the vocabulary, with different colors to show the associated part-of-speech (POS) tagging. For instance, blue demonstrates Noun in singular form (NN), red shows Determiner (DT) and orange indicates Verb, Gerund or Present Participles (VGB). The edge between two nodes presents the proximity relationship of two words. It is worth noting that the massiveness of semantic graphs imply the diversity and richness of generated paragraphs in an intuitive way. The Ground-truth graph (in Figure 6(c)), annotated by human beings, contains the most comprehensive relationships and extensive object entities. Even though there is still a gap between synthetic paragraphs and real natural language, our generated paragraphs by CRL (in Figure 6(b)) have a much wider vocabulary compared with the one generated by RL-based method SCST (in Figure 6(a)).

5. Conclusion

In this work, we propose an intrinsically motivated reinforcement learning model for visual paragraph generation. Towards generating diverse sentences with coherence, the proposed CRL mines the human writing patterns behind long narratives and well captures precise expressions by modeling the agent’s uncertainty of the environment. Distinguishing our work from conventional policy-based and actor-critic based reinforcement learning methods, it alleviates the sparse reward and low exploration issues and thus encourages the agent to fully explore rare states and obtain a better policy.

6. Acknowledgement

This work was partially supported by the National Natural Science Foundation of China under Project 61572108 and Project 61632007, Sichuan Science and Technology Program (No. 2018GZDZX0032) and ARC DP 190102353.


  • J. Achiam and S. Sastry (2017) Surprise-based intrinsic motivation for deep reinforcement learning. CoRR abs/1703.01732. External Links: Link, 1703.01732 Cited by: §2.3.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. See DBLP:conf/cvpr/2018, pp. 6077–6086. Cited by: §4.5.2.
  • M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. See DBLP:conf/nips/2016, pp. 1471–1479. Cited by: §2.3.
  • Y. Bin, Y. Yang, J. Zhou, Z. Huang, and H. T. Shen (2017) Adaptively attending to visual attributes and linguistic knowledge for captioning. See DBLP:conf/mm/2017, pp. 1345–1353. External Links: Link, Document Cited by: §2.1.
  • Y. Burda, H. Edwards, D. Pathak, A. J. Storkey, T. Darrell, and A. A. Efros (2018a) Large-scale study of curiosity-driven learning. CoRR abs/1808.04355. External Links: Link, 1808.04355 Cited by: §2.3.
  • Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov (2018b) Exploration by random network distillation. CoRR abs/1810.12894. External Links: Link, 1810.12894 Cited by: §2.3.
  • M. Chatterjee and A. G. Schwing (2018) Diverse and coherent paragraph generation from images. See DBLP:conf/eccv/2018-2, pp. 747–763. Cited by: §1, §1, §2.2, §4.2, §4.4.1, Table 1.
  • H. Chen, G. Ding, S. Zhao, and J. Han (2018) Temporal-difference learning with sampling baseline for image captioning. See DBLP:conf/aaai/2018, pp. 6706–6713. Cited by: §1, §2.1.
  • T. Chen, Y. Liao, C. Chuang, W. T. Hsu, J. Fu, and M. Sun (2017) Show, adapt and tell: adversarial training of cross-domain image captioner. See DBLP:conf/iccv/2017, pp. 521–530. Cited by: §2.1.
  • B. Dai, S. Fidler, R. Urtasun, and D. Lin (2017) Towards diverse and natural image descriptions via a conditional GAN. See DBLP:conf/iccv/2017, pp. 2989–2998. Cited by: §1, §1, §2.1, §2.2.
  • M. J. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. See DBLP:conf/wmt/2014, pp. 376–380. Cited by: §1, §4.1.2.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. See DBLP:conf/nips/2014, pp. 2672–2680. Cited by: §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. See DBLP:conf/cvpr/2016, pp. 770–778. Cited by: §4.2, §4.3.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §4.5.2.
  • R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel (2016) VIME: variational information maximizing exploration. See DBLP:conf/nips/2016, pp. 1109–1117. Cited by: §2.3.
  • T. (. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. B. Girshick, X. He, P. Kohli, D. Batra, C. L. Zitnick, D. Parikh, L. Vanderwende, M. Galley, and M. Mitchell (2016) Visual storytelling. See DBLP:conf/naacl/2016, pp. 1233–1239. Cited by: §1.
  • J. Johnson, A. Karpathy, and L. Fei-Fei (2016)

    DenseCap: fully convolutional localization networks for dense captioning

    See DBLP:conf/cvpr/2016, pp. 4565–4574. Cited by: §2.2, §4.2.
  • A. Karpathy and F. Li (2015) Deep visual-semantic alignments for generating image descriptions. See DBLP:conf/cvpr/2015, pp. 3128–3137. Cited by: §4.2, §4.4.1, Table 1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. See DBLP:conf/iclr/2015, Cited by: §4.2.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. CoRR abs/1312.6114. External Links: Link, 1312.6114 Cited by: §4.2.
  • J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei (2017) A hierarchical approach for generating descriptive image paragraphs. See DBLP:conf/cvpr/2017, pp. 3337–3345. Cited by: §1, §1, §2.2, §4.1.1, §4.2, §4.4.1, §4.4.2, Table 1.
  • J. Li, M. Luong, and D. Jurafsky (2015)

    A hierarchical neural autoencoder for paragraphs and documents

    See DBLP:conf/acl/2015-1, pp. 1106–1115. Cited by: §1.
  • Y. Li, Y. Luo, Z. Zhang, S. Sadiq, and P. Cui (2019) Context-aware attention-based data augmentation for POI recommendation. In 35th IEEE International Conference on Data Engineering Workshops, ICDE Workshops 2019, Macao, China, April 8-12, 2019, pp. 177–184. Cited by: §1.
  • X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing (2017) Recurrent topic-transition GAN for visual paragraph generation. See DBLP:conf/iccv/2017, pp. 3382–3391. Cited by: §1, §1, §2.2, §4.2, §4.4.1, Table 1.
  • D. Liu, Z. Zha, H. Zhang, Y. Zhang, and F. Wu (2018a) Context-aware visual policy network for sequence-level image captioning. See DBLP:conf/mm/2018, pp. 1416–1424. Cited by: §1, §1, §1, §2.1, §3.7.
  • L. Liu, X. Wan, and Z. Guo (2018b) Images2Poem: generating chinese poetry from image streams. See DBLP:conf/mm/2018, pp. 1967–1975. Cited by: §1.
  • L. Liu, X. Wan, and Z. Guo (2018c) Images2Poem: generating chinese poetry from image streams. See DBLP:conf/mm/2018, pp. 1967–1975. Cited by: §1.
  • S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy (2017) Improved image captioning via policy gradient optimization of spider. See DBLP:conf/iccv/2017, pp. 873–881. Cited by: §1, §2.1, §3.7.
  • M. Lopes, T. Lang, M. Toussaint, and P. Oudeyer (2012) Exploration in model-based reinforcement learning by empirically estimating learning progress. See DBLP:conf/nips/2012, pp. 206–214. Cited by: §2.3.
  • Y. Luo, Z. Wang, Z. Huang, Y. Yang, and C. Zhao (2018) Coarse-to-fine annotation enrichment for semantic segmentation learning. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, pp. 237–246. Cited by: §1.
  • Y. Mao, C. Zhou, X. Wang, and R. Li (2018) Show and tell more: topic-oriented multi-sentence image captioning. See DBLP:conf/ijcai/2018, pp. 4258–4264. Cited by: §2.2, §4.2, §4.4.1, Table 1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. See DBLP:conf/acl/2002, pp. 311–318. Cited by: §1, §3.6.1, §4.1.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.3.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. See DBLP:conf/icml/2017, pp. 2778–2787. Cited by: §2.3.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2016) Sequence level training with recurrent neural networks. In ICLR, External Links: Link Cited by: §4.4.1.
  • H. Ravi, L. Wang, C. Muñiz, L. Sigal, D. N. Metaxas, and M. Kapadia (2018) Show me a story: towards coherent neural story illustration. See DBLP:conf/cvpr/2018, pp. 7613–7621. Cited by: §1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. See DBLP:conf/nips/2015, pp. 91–99. Cited by: §3.1, §4.3.1.
  • Z. Ren, X. Wang, N. Zhang, X. Lv, and L. Li (2017) Deep reinforcement learning-based image captioning with embedding reward. See DBLP:conf/cvpr/2017, pp. 1151–1159. Cited by: §2.1.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. See DBLP:conf/cvpr/2017, pp. 1179–1195. Cited by: §1, §1, §1, §2.1, §3.7, Figure 6, §4.2, §4.4.1, §4.4.2, Table 1.
  • S. P. Singh, A. G. Barto, and N. Chentanez (2004) Intrinsically motivated reinforcement learning. See DBLP:conf/nips/2004, pp. 1281–1288. Cited by: §2.3.
  • R. S. Sutton, A. G. Barto, F. Bach, et al. (2017) Reinforcement learning: an introduction (2nd edition). MIT press. Cited by: §1, §2.1, §3.3, §3.6.1.
  • H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel (2017) #exploration: A study of count-based exploration for deep reinforcement learning. See DBLP:conf/nips/2017, pp. 2750–2759. Cited by: §2.3.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2015) CIDEr: consensus-based image description evaluation. See DBLP:conf/cvpr/2015, pp. 4566–4575. Cited by: §1, §3.6.1, §4.1.2, §4.4.1.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: A neural image caption generator. See DBLP:conf/cvpr/2015, pp. 3156–3164. Cited by: §4.2, §4.2, §4.4.1, Table 1.
  • X. Wang, W. Chen, Y. Wang, and W. Y. Wang (2018a) No metrics are perfect: adversarial reward learning for visual storytelling. See DBLP:conf/acl/2018-1, pp. 899–909. Cited by: §1.
  • X. Wang, W. Chen, J. Wu, Y. Wang, and W. Y. Wang (2018b) Video captioning via hierarchical reinforcement learning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 4213–4222. Cited by: §2.1.
  • Z. Wang, Y. Luo, Y. Li, Z. Huang, and H. Yin (2018c) Look deeper see richer: depth-aware image paragraph captioning. See DBLP:conf/mm/2018, pp. 672–680. Cited by: §2.2, §4.2, §4.4.1, Table 1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, pp. 229–256. Cited by: §1, §2.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. See DBLP:conf/icml/2015, pp. 2048–2057. Cited by: §4.5.2.
  • L. Xu, L. Jiang, C. Qin, Z. Wang, and D. Du (2018) How images inspire poems: generating classical chinese poetry from images with memory networks. See DBLP:conf/aaai/2018, pp. 5618–5625. Cited by: §1.
  • X. Yang, X. Lin, S. Suo, and M. Li (2018a) Generating thematic chinese poetry using conditional variational autoencoders with hybrid decoders. See DBLP:conf/ijcai/2018, pp. 4539–4545. External Links: Link, Document Cited by: §1.
  • Y. Yang, J. Zhou, J. Ai, Y. Bin, A. Hanjalic, H. T. Shen, and Y. Ji (2018b) Video captioning by adversarial LSTM. IEEE Transactions on Image Processing 27 (11), pp. 5600–5611. Cited by: §2.1.
  • H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu (2016) Video paragraph captioning using hierarchical recurrent neural networks. See DBLP:conf/cvpr/2016, pp. 4584–4593. Cited by: §1.
  • Y. Yu (2018) Towards sample efficient reinforcement learning. See DBLP:conf/ijcai/2018, pp. 5739–5743. Cited by: §3.7.
  • L. Zhang, F. Sung, F. Liu, T. Xiang, S. Gong, Y. Yang, and T. M. Hospedales (2017) Actor-critic sequence training for image captioning. In NIPS Workshop on Visually-Grounded Interaction and Language, Cited by: §1, §1, §2.1, §3.7.
  • M. Zhang, Y. Yang, H. Zhang, Y. Ji, H. T. Shen, and T. Chua (2019) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing 28 (1), pp. 32–44. Cited by: §2.1.