Video Captioning via Hierarchical Reinforcement Learning

11/29/2017 ∙ by Xin Wang, et al. ∙ The Regents of the University of California 0

Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the challenge by proposing a novel hierarchical reinforcement learning framework for video captioning, where a high-level Manager module learns to design sub-goals and a low-level Worker module recognizes the primitive actions to fulfill the sub-goal. With this compositional framework to reinforce video captioning at different levels, our approach significantly outperforms all the baseline methods on a newly introduced large-scale dataset for fine-grained video captioning. Furthermore, our non-ensemble model has already achieved the state-of-the-art results on the widely-used MSR-VTT dataset.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting the meaning from video pixels and generating natural-sounding description is a very challenging problem. However, due to its wide range of applications such as intelligent video surveillance and assistance to visually-impaired people, video captioning has drawn increasing attention from the computer vision community recently. Different from

image captioning which aims at describing a static scene, video captioning is more challenging in the sense that a series of coherent scenes need to be understood in order to jointly generate multiple description segments (e.g., see Figure 1).

Figure 1: Video captioning examples. Top row is an example from MSR-VTT dataset [42], which is summarized by three single captions. Bottom row is an example from Charades [31] dataset, which consists of several dependent human activities and is described by multiple long sentences of complex structure.

Current video captioning tasks can mainly be divided into two families, single-sentence generation [42, 19] and paragraph generation [27]. Single-sentence generation tends to abstract a whole video to a simple and high-level descriptive sentence, while paragraph generation tends to grasp more detailed actions, and generates multiple sentences of descriptions. However, even for paragraph generation, the paragraph is often split into multiple, single-sentence generation scenarios associated with ground truth temporal video intervals.

In many practical cases, human activities are too complex to be described with short, simple sentences, and the temporal intervals are hard to be predicted ahead of time without a good understanding of the linguistic context. For instance, in the bottom example of Figure 1, there are five human actions in total: sit on a bed, put a laptop into a bag are happening simultaneously, and then followed by stand up, put the bag on one shoulder and walk out of the room in order. Such fine-grained caption requires a subtle and expressive mechanism to capture the temporal dynamics of the video content and associate that with semantic representations in natural language.

In order to tackle this issue, we propose a “divide and conquer” solution, which first divides a long caption into many small text segments (e.g. different segments are in different colors as shown in Figure 1), and then employs a sequence model to conquer each segment. Instead of forcing the sequence model to generate the whole sequence in one shot, we propose to guide the model to generate sentences segment by segment. With a higher-level sequence model designing the context of each segment, the low-level sequence model follows the guidance to generate the segment word by word.

In this paper, we propose a novel hierarchical reinforcement learning (HRL) framework to realize this two-level mechanism. The textual and video context can be viewed as the reinforcement learning environment

. Our framework is a fully-differentiable deep neural network (see Figure 

2) and consists of (1) the higher-level sequence model manager that sets goals at a lower temporal resolution, (2) the lower-level sequence model worker that selects primitive actions at every time step by following the goals from the Manager, and (3) an internal critic that determines whether a goal is accomplished or not. More specifically, by exploiting the context from both the environment and finished goals, the manager emits a new goal for a new segment, and the worker receives the goal as guidance to generate the segment by producing words sequentially. Moreover, the internal critic is employed to evaluate whether the current textual segment is accomplished.

Furthermore, we equip both the manager and worker with an attention module over the video features (Sec 3.2) to introduce hierarchical attention internally so that the manager will focus on a wider range of temporal dynamics while the worker’s attention is narrowed down to local dynamics conditioned on the goals. To the best of our knowledge, this is the first work that strives to develop a hierarchical reinforcement learning approach to reinforce video captioning at different levels. Our main contributions are four-fold:

  • We propose a hierarchical deep reinforcement learning framework to efficiently learn the semantic dynamics when captioning a video.

  • We formulate an alternative, novel training approach over stochastic and deterministic policy gradient.

  • We introduce a new large-scale dataset for fine-grained video captioning, Charades Captions111Charades Captions was obtained by preprocessing the raw Charades dataset [31]. The processed Charades Captions dataset can be downloaded here:, and validate the effectiveness of the proposed method in it.

  • We further evaluate our approach on MSR-VTT dataset and achieve the state-of-the-art results even when training on a single type of features.

Figure 2: Overview of the HRL framework for video captioning. Please see Sec. 3.1 for explanation.

2 Related Work

Video Captioning

S2VT [37] first generalized LSTM to video captioning and proposed a sequence-to-sequence model for it. Since then, numerous improvements were introduced, such as attention [43, 45]

, hierarchical recurrent neural network (RNN) 

[44, 18, 3, 33, 40], C3D features [29], joint embedding space [23], language fusion [8], multi-task learning [20]

, etc. But most of them use the maximum-likelihood algorithm, which maximizes the probability of current ground-truth output given previous ground-truth output, while the previous ground-truth is in general unknown during test time. This inconsistency issue known as exposure bias has largely hindered the system performance.

In order to address the inconsistency issue, Ranzato et al. [24] proposed to directly optimize non-differentiable metric scores using the REINFORCE algorithm [41]

. But the problem persisted that the expected gradient computed using policy gradient typically exhibited high variance and was often unstable without proper context-dependent normalization. Naturally, the variance could be reduced by adding a baseline 

[16, 26]

or even an actor-critic method that trained an additional critic to estimate the value of each generated word 

[1, 25, 48]. Pasunuru and Bansal [21] applied policy gradient with baseline on video captioning and presented textual entailment loss to adjust the CIDEr reward. Unfortunately, these previous work for image/video captioning fail to grasp the high-level semantic flow. Our HRL model aims to address this issue with a hierarchical reinforcement learning framework.

Another line of work is dense video captioning 

[12], which focuses on detecting multiple events that occur in a video and describing each of them. But it does not aim to solve the single-sentence generation scenario. While our method aims to generate one or multiple sentences for a sequence of continuous actions (one or multiple).

Hierarchical Reinforcement Learning

Recent work has revealed the effectiveness of hierarchical reinforcement learning frameworks on Atari games [14, 39]. Peng et al. built a composite dialogue policy using hierarchical Q-learning to fulfill complex dialogue tasks like traveling plans [22]. In the typical HRL setting, there was a high-level agent that operated at the lower temporal resolution to set a sub-goal, and a low-level agent that selected primitive actions by following the sub-goal from the high-level agent. Our proposed HRL framework for video captioning is aligned to these studies but has a key difference from the typical HRL setting: instead of having the internal critic to provide an intrinsic reward to encourage the low-level agent to accomplish the sub-goal, we focus on exploiting the extrinsic rewards in different time spans. Besides, we are the first to consider HRL in the intersection vision and language.

3 Our Approach

3.1 Overview

Our proposed HRL framework follows the general encoder-decoder framework (see Figure 2). In the encoding stage, video frame features

are first extracted by a pretrained convolutional neural network (CNN) 

[13] model, where indexes the frames in the temporal order. Then the frame features are passed through a low-level Bi-LSTM222

Bidirectional long short-term memory 

encoder and a high-level LSTM333Long short-term memory [10] encoder successively to obtain low-level encoder output ( denotes the encoder associated with the Worker), and high-level encoder output ( denoting the encoder associated with the Manager), where . In the decoding stage, our HRL agent plays the role of a decoder, and outputs a language description , where is the length of the generated caption and is the vocabulary set.

The HRL agent is composed of three components: a low-level worker, a high-level manager, and an internal critic. The manager operates at a lower temporal resolution and emits a goal when needed for the worker to accomplish, and the worker generates a word for each time step by following the goal proposed by the manager. In other words, the manager asks the worker to generate a semantic segment, and the worker generates the corresponding words in the next few time steps in order to fulfill the job. The internal critic determines if the worker has accomplished the goal and sends a binary segment signal to the manager to help it update goals. The whole pipeline terminates once an end of sentence token is reached.

3.2 Policy Network

Attention Module

As mentioned above, the CNN-RNN encoder receives the video inputs to generate a sequence of vectors

and . One may directly take them as the inputs to the worker and the manager. We instead adopt an attention mechanism to better capture the temporal dynamics, and form the context vector for their use. In our model, both the manager and the worker are equipped with an attention module.

The left-hand side of Figure 3 is a demo attention module for the worker, at each time step , the context vector is computed as a weighted sum over the encoder’s all hidden states


These attention weights act as an alignment mechanism by giving higher weights to certain encoder hidden states which match the worker’s current status, and are defined as




where and are learned parameters; is the worker LSTM’s hidden state at previous step.

The manager’s attention module follows the same paradigm as the worker’s, which can be described by replacing the corresponding terms in Equation 1, 2, and 3.

Figure 3: An example of the unrolled HRL agent in the decoding stage (from time step to ). The yellow region shows how the attention module is incorporated into the encoder-decoder framework.
Manager and Worker

As is shown in Figure 3, the concatenation of [] is fed as the input to the manager LSTM to produce the semantically meaningful goal. With the help of the context and the sentence state at previous time steps, the manager can obtain the knowledge of the environment status. The output of the manager LSTM is then projected as a latent continuous goal vector . Formally,


where denotes the non-linear function of the manager LSTM and is a function to project hidden states into goal space.

The worker receives the goal , takes the concatenation of [] as the input, and outputs the probabilities over all actions after a series of computations:


where is the non-linear function of the worker LSTM and

is a also a function to project hidden states into the input to softmax layer.

Internal Critic

In order to determine whether the worker has accomplished a goal , we employ an internal critic to evaluate worker’s progress. The internal critic uses an RNN structure, which takes a word sequence as the input to discriminate whether an end has been reached. Let denote the signal of internal critic and denote the hidden state of the RNN at time step , formally we describe the probability as follows:


where is the action taken by the worker and

denotes the parameters of the feed-forward neural network. In order to train the parameters of the linear layer and recurrent network, we propose to maximize the likelihood of given ground truth signal



Once the critic model is optimized, we will fix it to service the usage of the manager.

3.3 Learning

As described in Sec. 3.2, the manager policy is actually deterministic, which can be further denoted as with representing the parameters of the manager, while the worker policy is a stochastic policy denoted by , where represents the parameters of the worker. The reason why the worker policy is stochastic is that its action is selecting a word from the vocabulary . But for the manager, the generated goal is latent, which cannot be directly supervised. Thus with a deterministic manager policy, we can warm start both the manager and worker simultaneously by viewing them as a composite agent.

In this section, we first derive the mathematical reinforce learning methods for the policies separately (Sec. 3.3.1 and 3.3.2), and then introduce the training algorithm of the proposed HRL method (Sec. 3.3.4). We also discuss the reward definitions (Sec. 3.3.3

) and imitation learning of our HRL policy (Sec. 


3.3.1 Stochastic Worker Policy Learning

We consider a standard reinforcement learning setup. At each step , the worker select an action () conditioned on from the manager. The environment responds with a new state and a scalar reward . The process continues until a EOS token is generated. The objective of the worker is to maximize the discounted return

. Thus its loss function can be written as


to minimize the negative expected reward function. Based on REINFORCE algorithm [41], the gradient of non-differentiable, reward-based loss function can be derived as


In practice is typically estimated with a single sample from :


The policy gradient given by REINFORCE can be further generalized to reduce the variance without changing the expected gradient, by subtracting the reward with a baseline [35]:


where is the estimated baseline, which can be a function of or  [24]. In our case, the baseline is estimated by a linear regressor with the worker’s hidden state as the input. During back propagation, the gradient passing is cut off between the worker LSTM and the baseline estimator.

For a better understanding of the policy gradient, we can further derive the loss function using the chain rule


where is the input to the SoftMax layer (see Equation 7). Using REINFORCE with baseline the estimation of is given by [46]:


which means if the reward of the sample word is greater than the baseline , the gradient is negative and thus the model encourages the distribution by increasing the probability of the word, otherwise, it discourages the distribution accordingly.

3.3.2 Deterministic Manager Policy Learning

The key to our HRL framework is to effectively learn the goal generated by the manager and then guides the worker to achieve the latent objective. But the difficulty of training the manager is that it does not directly interact with the environment because the action it takes is to produce a latent vector in a continuous high-dimensional space, which indirectly influences the environment by directing the Worker’s behavior. Therefore, we are especially interested in coming up solutions to encourage the manager towards more effective caption generation.

Inspired by the deterministic policy gradient algorithms [32, 15], we propose to learn the deterministic policy from trajectories generated by the stochastic worker policy . When training the target manager policy, we fix the worker policy as an Oracle behavior policy. More specifically, the manager outputs a goal at step and the worker then runs steps to generate the expected segment by following the goal ( is length of the generated segment). Since the worker is fixed as an Oracle behavior policy, we only need to consider the training of the manager. Then the environment responds with an new state and a scalar reward . Thus the objective becomes minimizing the negative discounted return , in formula


After applying the chain rule to the loss function with respect to the manager’s parameters , the manager is updated with


The above gradients can be approximated from a single sampled segment and after adopting policy gradient on the worker policy,


Since the worker LSTM is indeed a Markov decision process and the probability of the current action

is conditioned on the action at previous step (see Equation 6,7,8), we have


Combining Equation 19 and 20, then the gradients become


The final gradients for the manager training is obtained by adding the baseline estimator to reduce the variance as follows:


where is the baseline estimator, which is a linear regressor with the manager’s hidden state as the input.

A major challenge of learning in continuous action spaces is exploration. We follow the known DDPG [15] to construct an exploration policy by adding perturbation

sampled from a Gaussian distribution

to our manager policy


and the variance of Gaussian noise can be chosen to suit the environment.

3.3.3 Reward Definition

Recent work on image captioning [26]

has shown that CIDEr as a reward performs the best among the traditional evaluation metrics (

e.g. CIDEr, BLEU or METEOR) for image/video captioning and can gain improvement on all other metrics. In our model, we also use CIDEr score to compute the reward. But instead of directly using the final CIDEr score of the whole generated caption as the reward for each word , we adopt delta CIDEr score as the immediate reward. Let , where is the previous generated caption. Then the discounted return for the worker is


where denotes the time step of the worker’s temporal resolution, and the discounted return for the manager is


where is the time step of the manager’s lower temporal resolution. Note that our approach is not limited to CIDEr score, other reasonable rewards (e.g. deltaBLEU [7]) can also be applied to the HRL framework.

3.3.4 Training Algorithm

Above we illustrate the learning methods to train the manager and the worker. In Algorithm 1 we present the pseudo-code of our HRL training algorithm for video captioning. The manager policy and the worker policy are trained alternately. Basically, when training the worker, we assume the manager is well-posed, so we disable the goal exploration and only update the worker policy according to Equation 14; when training the manager, we treat the worker as the Oracle behavior policy, so we generate the caption by greedy decoding and only update the manager policy following Equation 22.

During testing, goal exploration is disabled, and beam search is employed to generate the results. Only one forward pass is needed at test time.

1:Training pairs video, GT caption
2:Randomly initialize the model parameters
3:Load the pretrained CNN model and internal critic
4:for iteration=1,M do
5:     Randomly sample a minibatch
6:     if Train-Worker then
7:         Disable the goal exploration
8:         Run a forward pass to get the sampled caption
9:         Calculate for each
10:         Freeze the manager
11:         Update the worker policy using Equation 14
12:     else if Train-Manager then
13:         Initialize a random process for goal exploration
14:         Run a forward pass to get the greedily decoded caption
15:         Calculate for each
16:         Freeze the worker
17:         Update the manager policy using Equation 22
18:     end if
19:end for
Algorithm 1 HRL training algorithm

3.3.5 Imitation Learning

A major challenge for a reinforcement learning agent to have good convergence property is that the agent must start with a good policy at the beginning stage. For our model, we apply the cross-entropy loss optimization to warm start both the worker and the manager simultaneously, where the manager is completely treated as the latent parameters. be the parameters of the whole model and be the ground-truth word sequence, then the cross-entropy loss is defined as


4 Experimental Results

4.1 Datasets


MSR-VTT [42] is a dataset for general video captioning, which is derived from a wide variety of video categories (7,180 videos from 20 general categories), and contains 10,000 video clips (6,513 for training, 497 for validation, and the remaining 2,990 for testing). Each video contains 20 human annotated reference captions collected by Amazon Mechanical Turk (AMT).

Charades Captions

Charades [31] is a large-scale dataset composed of 9,848 videos of daily indoors activities collected through AMT. 267 different users were presented with a sentence script (e.g. a person fixes the bed then throws pillow on it) that included objects and actions from a fixed vocabulary, and the users recorded a video following the script using provided objects and actions. The original dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos.

While the Charades dataset is mainly used for action recognition and segmentation, one should note that the collected textual descriptions are very detailed and depict the fine-grained human activities happening in long videos. Thus, we preprocessed the raw Charades dataset by combining the textual descriptions and sentence scripts verified through AMT444For example, the sentence script of a video can be A person is taking a picture of a light while sitting in a chair., and the textual description is A person in a bedroom appears to use their phone to film or take a picture of the light fixture on the ceiling. The latter is usually more detailed., and built a new large-scale dataset for detailed video captioning – Charades Captions, which consists of 6,963 videos for training, 500 for validation and 1,760 for testing. Each video clip is annotated by multiple (typically 2-5) captions. The captions are more detailed and longer than those of MSR-VTT (average caption length: 24.13 vs 9.28 words), which is more suitable for fine-grained video captioning.

Caption Segmentation

In order to train the internal critic that determines if a goal is accomplished, we preprocessed the ground truth captions of the training sets of both datasets by breaking each caption into multiple semantic chunks. We segmented the captions mainly based on the Noun Phrase (NP) and Verb Phrase (VP) labels provided by the constituency parsing results (We utilized the open source toolkits Stanford CoreNLP555 [17] and NLTK666 for constituency parsing). For instance, the caption The person then tidies his area after he is done eating was segmented into three sub-phrases, The person, then tidies his area and after he is done eating with labels NP, VP and VP respectively. However, all we need to train the internal critic were the chunks, and labels were not used.

Mean-Pooling 30.4 23.7 52.0 35.0
Soft-Attention 28.5 25.0 53.3 37.1
S2VT 31.4 25.7 55.9 35.2
v2t_navigator 40.8 28.2 60.9 44.8
Aalto 39.8 26.9 59.8 45.7
VideoLAB 39.1 27.7 60.6 44.1
XE-baseline 41.3 27.6 59.9 44.7
RL-baseline 40.6 28.5 60.7 46.3
HRL (Ours) 41.3 28.7 61.7 48.0

Table 1: Comparison with state of the arts on MSR-VTT dataset.

4.2 Experimental Setup

Evaluation Metrics

We adopted four diverse automatic evaluation metrics: BLEU, METEOR, ROUGE-L, and CIDEr-D. We used the standard evaluation code from MS-COCO server [5] to obtain the results.

Training Details

All the hyperparameters were tuned on the validation set. For both datasets, we sampled each video at

and extracted ResNet-152 features [9] from these sampled frames without fine-tuning. More training details can be found in the supplementary material.

4.3 Results and Analysis

Comparison with state of the arts on MSR-VTT

In Table 1, we compared our single-sentence captioning results with the-state-of-the-art methods on MSR-VTT dataset. We listed the results of Mean-Pooling [38], Soft-Attention [43] and S2VT [37] as reported in previous work [29]. We also compared with the top-3 results from MSR-VTT challenge, including v2t navigator [11], Aalto [30], VideoLAB [23].

We implemented two baseline methods: an attention-based sequence-to-sequence model trained with cross-entropy loss (XE-baseline), and the same model trained with policy gradient and CIDEr score as the RL reward (RL-baseline). As shown in Table 1, our XE-baseline achieved comparable results with the state-of-the-art results, and our RL-baseline further improved on all metrics. Moreover, our novel HRL method outperformed all the other algorithms listed in the table, which proved the effectiveness of our proposed method.

Result Analysis on Charades Captions

B@1 B@2 B@3 B@4 M R C
XE-baseline 55.0 36.4 23.6 15.0 18.7 39.0 16.7
RL-baseline 57.6 41.4 28.0 18.8 17.7 39.8 21.6
HRL-16 64.4 44.3 29.4 18.8 19.5 41.4 23.2
HRL-32 64.0 43.4 28.4 17.9 19.2 41.0 21.3
HRL-64 61.7 43.0 28.8 18.8 18.7 31.2 23.6

Table 2: Results on Charades Captions dataset. We reported BLEU (B), METEOR (M), ROUGH-L (R) and CIDEr (C) scores of our HRL method and two baselines for comparison.
Figure 4: Qualitative comparison with the baseline methods. The given examples were from the test set of the Charades Caption dataset.

Since there were no other papers reporting results on Charades Captions, we mainly compared our HRL model with our implementation of XE-baseline and RL-baseline. Meanwhile, we explored the dimension of the latent goal vector (We used HRL- to denote the HRL model with a goal dimension of ). As can be observed from Table 2, all our HRL models outperformed the baseline methods and brought significant improvements in different evaluation metrics. Note that our HRL model achieved bigger improvement over the baseline methods on Charades Captions dataset than on MSR-VTT. Given that fact that the average cation length of Charades Captions was much longer than that of MST-VTT (24.13 vs 9.28 words), the difference of the improvement gaps demonstrated that our HRL model can gain better improvement on detailed descriptions of longer videos.

Among the HRL models, HRL-16 achieved the best on almost all metrics (CIDEr score was the second-best and slightly worse than HRL-64). Even though HRL-64 obtained better results on BLEU@4 and CIDEr, its results on other metrics were worse than HRL-32 (the ROUGE-L score was much lower than HRL-32). Thus, comparing the results of different HRL models, we could conclude that HRL-16 HRL-32 HRL-64. This result accorded with our speculation: higher dimension does not guarantee better performance, conversely, the exploration space grows exponentially as the dimension increases, making the learning even harder. A latent vector of small dimension like 16 is able to represent the semantically meaningful goal well.

Qualitative Comparison with Baseline Methods

In Figure 4, we illustrated two examples from Charades Captions test set. According to the captions generated by different models, it is obvious that the generated results of our HRL model matched the ground truth captions better than the baseline methods. Moreover, due to the segment-by-segment generation manner, our HRL model was able to output a sequence of semantically meaningful phases (different phases were in different colors and segmented by “” as in Figure 4).

Learning Curve

For a more intuitive view of the models, we drew the learning curves of the CIDEr scores on validation set (see Figure 5). Note that the RL-baseline model was first warmed up with cross-entropy loss, and then improved using the REINFORCE algorithm. Particularly, after we trained the XE-baseline model, we switched to policy gradient and continued training the RL-baseline model on it. HRL models were resumed training on a shorter warm-start period. As is shown in Figure 5, the HRL models converged faster and achieved better peak points than the baseline methods. HRL-16 reached the highest point.

Figure 5: Learning curves of the CIDEr scores of different captioning models, including XE-baseline, RL-baseline and HRL models with goal dimension of 16, 32 and 64.

5 Conclusion

In this paper, we propose a hierarchical reinforcement learning framework for video captioning, which aims at improving the fine-grained generation of video descriptions with rich activities. Our HRL model obtains the state-of-the-art performance on both the widely used MSR-VTT dataset and the newly introduced Charades Captions dataset for fine-grained video captioning.

In the future, we plan to explore the attention space and utilize features from multiple modalities to boost our HRL agent. We believe that the results of our method can be further improved by employing different types of features, i.e. C3D features [36]

, optical flows, etc. Meanwhile, we will investigate the HRL framework in other similar sequence generation tasks like video/document summarization.


We would thank Ramakanth Pasunuru and Ruotian Luo for clarifying the technical details of their paper/code, and Wenhan Xiong for his help on debugging the model. Personally, Xin would appreciate the care from his girlfriend (now his wife) when he was busying working on the paper.

Supplementary Material

Appendix A Attention Visualization

Fig. 6 demonstrated a visualization example where the associated attentions of the learned text segments over video frames were plotted. Clearly, when generating different text segments, the HRL model attended to different temporal frames. For example, when the model was producing the segment is cooking on the stove, the first halve of the video, which contained the action cooking, played a more important role with larger attention values.

Appendix B Qualitative Examples on MSR-VTT

In the main paper, we showed some generated results on Charades Captions dataset. Here we demonstrated more qualitative examples on MST-VTT dataset in Figure 7.

Particularly, Example (a) and (b) revealed that our HRL method was able to capture more details of the video content and generate more fine-grained descriptions. For example, our HRL model provided both the event (a group of people are dancing) and the scene (on the beach) in Example (a) while the other baseline methods failed to depict where the event is happening. Example (c) (d) (e) and (f) further illustrated the correctness and accuracy of our HRL results. For instance, in Example (c), only the result of our HRL method described the video correctly. The ground truth caption was a group of men are racing around a track and our result was a group of people are running on a track. While both the XE-baseline and RL-baseline captioned by mistake the video with a group of people are playing a game and a man is playing a football game respectively. Apparently, compared with the results of the baseline methods, our results were more accurate and descriptive in general.

Appendix C Network Architecture

In this section, we illustrate the exact architecture used for the experiments (see Figure 2 in the main paper).


For both datasets, we sampled each video at and used ResNet-152 [9]

(pretrained CNN model on ImageNet) to extract frame features without fine-tuning. Then the 2048-dim frame features were projected to 512-dim. The low-level encoder was a Bi-LSTM with hidden size 512, and the high-level encoder was an LSTM with hidden size 256.


The worker network consisted of a worker LSTM with hidden size 1024, an attention module similar to the one proposed by Bahdanau et al. [2], a word embedding of size 512, and a projection module (Linear Tanh Linear SoftMax) that produced the probabilities over all tokens in the vocabulary.


The manager network was composed of a manager LSTM with hidden size 256, an attention module, and a linear layer that projected the output of the LSTM into latent goal space.

Internal Critic

The internal critic was also an RNN network, which contained a GRU [6]

, a built-in word embedding, a linear layer, and a Sigmoid function. The hidden size of the GPU and the word embedding size were both 128 for MSR-VTT and 64 for Charades Captions.

Figure 6: A visualization demo of the attentions. Different text segments were in different colors, and the associated attentions were provided below the corresponding segments. We also showed the keyframe in the top row, which was selected from the most noticeable area for each segment.

Appendix D Training Details

All the hyperparameters were tuned on the validation set, including the dimension sizes in Sec. C. Moreover, we adopted Dropout [34]

with a value 0.5 for regularization. All the gradients were clipped into the range [-10, 10]. We initialized all the parameters with a uniform distribution in the range [-0.1, 0.1]. For MSR-VTT dataset, we used a fixed step size of 50 for the encoder LSTMs and a maximum length of 30 for the captions. For Charades Captions dataset, they were set to 150 and 60 respectively.

To train the cross-entropy (XE) models, Adadelta optimizer [47]

was used with batch size 64. The learning rate was initially set as 1 and then reduced by a factor 0.5 when the current CIDEr score did not surpass the previous best for 4 epochs. Schedule sampling 

[4] was employed to train the XE models. When training the RL and HRL models, we used the pretrained XE models to warm start and then continued training them with a learning rate 0.1. The discounted factors of the Manager and the Worker were both 0.95. At test time, we used beam search of size 5.

Figure 7: Qualitative comparison with the baseline methods on MSR-VTT dataset. For each video example, we listed two ground truth captions, the generated result by XE-baseline (cross entropy), the result by RL-baseline (policy gradient), and the result by our HRL method (hierarchical reinforcement learning). In our HRL results, different segments were in different colors and separated with .