Streamlined Dense Video Captioning

04/08/2019 ∙ by Jonghwan Mun, et al. ∙ Amazon POSTECH ByteDance Inc. Wormpex Technology Seoul National University 0

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards at both event and episode levels for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding video contents is an important topic in computer vision. Through the introduction of large-scale datasets 

[9, 31]

and the recent advances of deep learning technology, research towards video content understanding is no longer limited to activity classification or detection and addresses more complex tasks including video caption generation 

[1, 4, 13, 14, 15, 22, 23, 26, 28, 30, 33, 35, 36].

Video captions are effective for holistic video description. However, since videos usually contain multiple interdependent events in context of a video-level story (i.e. episode), a single sentence may not be sufficient to describe videos. Consequently, dense video captioning task [8] has been introduced and getting more popular recently. This task is conceptually more complex than simple video captioning since it requires detecting individual events in a video and understanding their context. Fig. 1 presents an example of dense video captioning for a busking episode, which is composed of four ordered events.

Figure 1: An example of dense video captioning about a busking episode, which is composed of four interdependent events.

Despite the complexity of the problem, most existing methods [8, 10, 27, 37] are limited to describing an event using two subtasks—event detection and event description—in which an event proposal network is in charge of detecting events and a captioning network generates captions for the selected proposals independently.

We propose a novel framework for dense video captioning, which considers the temporal dependency of the events. Contrary to existing approaches shown in Fig. 2(a), our algorithm detects event sequences from videos and generates captions sequentially, where each caption is conditioned on prior events and captions as illustrated in Fig. 2(b). Our algorithm has the following procedure. First, given a video, we obtain a set of candidate event proposals from an event proposal network. Then, an event sequence generation network selects a series of ordered events adaptively from the event proposal candidates. Finally, we generate captions for the selected event proposals using a sequential captioning network. The captioning network is trained via reinforcement learning using both event and episode-level rewards; the event-level reward allows to capture specific content in each event precisely while the episode-level reward drives all generated captions to make a coherent story.

The main contributions of the proposed approach are summarized as follows:

  • [label=]

  • We propose a novel framework of detecting event sequences for dense video captioning. The proposed event sequence generation network allows the captioning network to model temporal dependency between events and generate a set of coherent captions to describe an episode in a video.

  • We present reinforcement learning with two-level rewards, episode and event levels, which drives the captioning model to boost coherence across generated captions and quality of description for each event.

  • The proposed algorithm achieves the state-of-the-art performance on the ActivityNet Captions dataset with large margins compared to the methods based on the existing framework.

The rest of the paper is organized as follows. We first discuss related works for our work in Section 2. The proposed method and its training scheme are described in Section 3 and 4 in detail, respectively. We present experimental results in Section 5, and conclude this paper in Section 6.

2 Related Work

Figure 2: Comparison between the existing approaches and ours for dense video captioning. Our algorithm generates captions for events sequentially conditioned on the prior ones by detecting an event sequence in a video.

2.1 Video Captioning

Recent video captioning techniques often incorporate the encoder-decoder framework inspired by success in image captioning [11, 16, 17, 25, 32]. Basic algorithms [22, 23]

encode a video using Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), and decode the representation into a natural sentence using RNNs. Then various techniques are proposed to enhance the quality of generated captions by integrating temporal attention 

[33], joint embedding space of sentences and videos [14], hierarchical recurrent encoder [1, 13], attribute-augmented decoder [4, 15, 36], multimodal memory [28], and reconstruction loss [26]. Despite their impressive performances, they are limited to describing a video using a single sentence and can be applied only to a short video containing a single event. Thus, Yu et al[35] propose a hierarchical recurrent neural network to generate a paragraph for a long video, while Xiong et al[30] introduce a paragraph generation method based on event proposals, where an event selection module determines which proposals need to be utilized for caption generation in a progressive way. Contrary to these tasks, which simply generate a sentence or paragraph for an input video, dense video captioning requires localizing and describing events at the same time.

2.2 Dense Video Captioning

Recent dense video captioning techniques typically attempt to solve the problem using two subtasks—event detection and caption generation [8, 10, 27, 37]; an event proposal network finds a set of candidate proposals and a captioning network is employed to generate a caption for each proposal independently. The performance of the methods is affected by the manual thresholding strategies to select the final event proposals for caption generation.

Based on the framework, Krishna et al[8] adopt a multi-scale action proposal network [3], and introduce a captioning network that exploits visual context from past and future events with an attention mechanism. In [27], a bidirectional RNN is employed to improve the quality of event proposals and a context gating mechanism in caption generation is proposed to adaptively control the contribution of surrounding events. Li et al[10] incorporate temporal coordinate and descriptiveness regressions for precise localization of event proposals, and adopt the attribute-augmented captioning network [34]. Rennie et al[37] utilize a self-attention [20] for event proposal and captioning networks, and propose a masking network for conversion of the event proposals to differentiable masks and end-to-end learning of the two networks.

In contrast to the prior works, our algorithm identifies a small set of representative event proposals (i.e., event sequences) for sequential caption generation, which enables us to generate coherent and comprehensive captions by exploiting both visual and linguistic context across selected events. Note that the existing works fail to take advantage of linguistic context since the captioning network is applied to event proposals independently.

3 Our Framework

This section describes our main idea and the deep neural network architecture for our algorithm in detail.

Figure 3: Overall framework of the proposed algorithm. Given an input video, our algorithm first extracts a set of candidate event proposals () using the Event Proposal Network (Section 3.2). From the candidate set, the Event Sequence Generation Network detects an event sequence () by selecting one out of the candidate event proposals (Section 3.3). Finally, the Sequential Captioning Network takes the detected event sequence and sequentially generates captions () conditioned on preceding events (Section 3.4). The three models are trained in a supervised manner (Section 4.1) and then the Sequential Captioning Network is optimized additionally with reinforcement learning using two-level rewards (Section 4.2).

3.1 Overview

Let a video contain a set of events with corresponding descriptions , where events are temporally localized using their starting and ending time stamps. Existing methods [8, 10, 27, 37] typically divide the whole problem into two steps: event detection followed by description of detected events. These algorithms train models by minimizing the sum of negative log-likelihoods of event and caption pairs as follows:

(1)

However, events in a video have temporal dependency and should be on a story about a single topic. Therefore, it is critical to identify an ordered list of events to describe a coherent story corresponding to the episode, the composition of the events. With this in consideration, we formulate dense video captioning as detection of an event sequence followed by sequential caption generation as follows:

(2)

The overall framework of our proposed algorithm is illustrated in Fig. 3. For a given video, a set of candidate event proposals is generated by the event proposal network. Then, our event sequence generation network provides a series of events by selecting one of candidate event proposals sequentially, where the selected proposals correspond to events comprising an episode in the video. Finally, we generate captions from the selected proposals using the proposed sequential captioning network, where each caption is generated conditioned on preceding proposals and their captions. The captioning network is trained via reinforcement learning using event and episode-level rewards.

3.2 Event Proposal Network (EPN)

EPN plays a key role in selecting event candidates. We adopt Single-Stream Temporal action proposals (SST) [2] due to its good performance and efficiency in finding semantically meaningful temporal regions via a single scan of videos. SST divides an input video into a set of non-overlapping segments with a fixed length (e.g., 16 frames), where the representation of each segment is given by a 3D convolution (C3D) network [19]. By treating each segment as an ending point of an event proposal, SST identifies its matching starting points from the preceding segments, which are represented by

-dimensional output vector from a Gated Recurrent Unit (GRU) at each time step. After extracting the top 1,000 event proposals, we obtain

candidate proposals, , by eliminating highly overlapping ones using non-maximum suppression. Note that EPN provides a representation of each proposal , which is a concatenated vector of two hidden states at starting and ending segments in SST. This visual representation, denoted by , is utilized for the other two networks.

3.3 Event Sequence Generation Network (ESGN)

Given a set of candidate event proposals, ESGN selects a series of events that are highly correlated and make up an episode for a video. To this ends, we employ a Pointer Network (PtrNet) [24] that is designed to produce a distribution over the input set using a recurrent neural network by adopting an attention module. PtrNet is well-suited for selecting an ordered subset of proposals and generating coherent captions with consideration of their temporal dependency.

As shown in Fig. 3, we first encode a set of candidate proposals, , by feeding proposals to an encoder RNN in an increasing order of their starting times, and initialize the first hidden state of PtrNet with the encoded representations to guide proposal selection. At each time step in PtrNet, we compute likelihoods over the candidate event proposals and select a proposal with the highest likelihood out of all available proposals. The procedure is repeated until PtrNet happens to select the END event proposal, , which is a special proposal to indicate the end of an event sequence.

The whole process is summarized as follows:

(3)
(4)
(5)

where is a hidden state in PtrNet, is an attention function computing confidence scores over proposals, and the representation of proposal in PtrNet, , is given by visual information as well as the location information . Also, is a selected event proposal at time step , which is given by

(6)

where corresponds to . Note that the location feature, , is a binary mask vector, where the elements corresponding to temporal intervals of an event are set to 1s and 0s otherwise. This is useful to identifying and disregarding proposals that overlap strongly with previous selections.

Our ESGN has clear benefits for dense video captioning. Specifically, it determines the number and order of events adaptively, which facilitates compact, comprehensive and context-aware caption generation. Noticeably, there are too many detected events in existing approaches (e.g., ) given by manual thresholding. On the contrary, ESGN detects only 2.85 on average, which is comparable to the average number of events per video in ActivityNet Caption dataset, 3.65. Although sorting event proposals is an ill-defined problem, due to their two time stamps (starting and ending points), ESGN naturally learns the number and order of proposals based on semantics and contexts in individual videos in a data-driven manner.

3.4 Sequential Captioning Network (SCN)

SCN employs a hierarchical recurrent neural network to generate coherent captions based on the detected event sequence , where is the number of selected events. As shown in Fig. 3, SCN consists of two RNNs—an episode RNN and an event RNN—denoted by and , respectively. The episode RNN takes the proposals in a detected event sequence one by one and models the state of an episode implicitly, while the event RNN generates words in caption sequentially for each event proposal conditioned on the implicit representation of the episode, i.e., based on the current context of the episode.

Formally, the caption generation process for the event proposal in the detected event sequence is given by

(7)
(8)

where is an episodic feature from the event proposal, and is a generated caption feature given by the last hidden state of the unrolled (denoted by ) event RNN. denotes a set of C3D features for all segments lying in temporal intervals of event proposal. The episode RNN provides the current episodic feature so that the event RNN generates context-aware captions, which are given back to the episode RNN.

Although both networks can be implemented with any RNNs conceptually, we adopt a single-layer Long Short-Term Memory (LSTM) with a 512 dimensional hidden state as the episode RNN, and a captioning network with temporal dynamic attention and context gating (TDA-CG) presented in 

[27] as the event RNN. TDA-CG generates words from a feature computed by gating a visual feature and an attended feature obtained from segment feature descriptors .

Note that sequential captioning generation scheme enables to exploit both visual context (i.e. how other events look) and linguistic context (i.e. how other events are described) across events, and allows us to generate captions in an explicit context. Although existing methods [8, 27] also utilize context for caption generation, they are limited to visual context and model with no linguistic dependency due to their architectural constraints from independent caption generation scheme, which would result in inconsistent and redundant caption generation.

4 Training

We first learn the event proposal network and fix its parameters during training of the other two networks. We train the event sequence generation network and the sequential captioning network in a supervised manner, and further optimize the captioning network based on reinforcement learning with two-level rewards—event and episode levels.

4.1 Supervised Learning

Event Proposal Network

Let be the confidence of the event proposal at time step in EPN, which is SST [2] in our algorithm. Denote the ground-truth label of the proposal by , which is set to 1 if the event proposal has a temporal Intersection-over-Union (tIoU) with ground-truth events larger than 0.5, and 0 otherwise. Then, for a given video and ground-truth labels , we train EPN by minimizing a following weighted binary cross entropy loss:

(9)

where , is the number of proposals containing each segment at the end and is the number of segments in the video.

Event Sequence Generation Network

For a video with ground-truth event sequence and a set of candidate event proposals , the goal of ESGN is to select a proposal highly overlapping with the ground-truth event , which is achieved by minimizing the following sum of binary cross entropy loss:

(10)

where is a temporal Intersection-over-Union value between two proposals, and is the likelihood that the event proposal is selected as the event.

Sequential Captioning Network

We utilize the ground-truth event sequence and its descriptions to learn our SCN via the teacher forcing technique [29]. Specifically, to learn two RNNs in SCN, we provide episode RNN and event RNN with ground-truth events and captions as their inputs, respectively. Then, the captioning network is trained by minimizing negative log-likelihood over words of the ground-truth captions as follows:

(11)

where denotes a predictive distribution over word vocabulary from the event RNN, and and mean the ground-truth word and the length of ground-truth description for the event.

Method Recall (@tIoU) Precision (@tIoU)
@0.3 @0.5 @0.7 @0.9 Average @0.3 @0.5 @0.7 @0.9 Average
MFT [30] 46.18 29.76 15.54 5.77 24.31 86.34 68.79 38.30 12.19 51.41
ESGN (ours) 93.41 76.40 42.40 10.10 55.58 96.71 77.73 44.84 10.99 57.57
Table 1: Event detection performances including recall and precision at four thresholds of temporal intersection of unions (@tIoU) on the ActivityNet Captions validation set. The bold-faced numbers mean the best performance for each metric.
Method with GT proposals with learned proposals
B@1 B@2 B@3 B@4 C M B@1 B@2 B@3 B@4 C M
DCE [8] 18.13 8.43 4.09 1.60 25.12 8.88 10.81 4.57 1.90 0.71 12.43 5.69
DVC [10] 19.57 9.90 4.55 1.62 25.24 10.33 12.22 5.72 2.27 0.73 12.61 6.93
Masked Transformer [37] 23.93 12.16 5.76 2.71 47.71 11.16 9.96 4.81 2.42 1.15 9.25 4.98
TDA-CG [27] - - - - - 10.89 10.75 5.06 2.55 1.31 7.99 5.86
MFT [30] - - - - - - 13.31 6.13 2.82 1.24 21.00 7.08
SDVC (ours) 28.02 12.05 4.41 1.28 43.48 13.07 17.92 7.99 2.94 0.93 30.68 8.82
Table 2: Dense video captioning results including Bleu@N (B@N), CIDEr (C) and METEOR (M) for our model and other state-of-the-art methods on ActivityNet Captions validation set. We report performances obtained from both ground-truth (GT) proposals and learned proposals. Asterisk () stands for the methods re-evaluated using the newer evaluation tool and star () indicates the methods exploiting additional modalities (e.g. optical flow and attribute) for video representation. The bold-faced numbers mean the best for each metric.

4.2 Reinforcement Learning

Inspired by the success in image captioning task [16, 17], we further employ reinforcement learning to optimize SCN. While similar to the self-critical sequence training [17] approach, the objective of learning our captioning network is revised to minimize the negative expected rewards for sampled captions. The loss is formally given by

(12)

where is a set of sampled descriptions from the detected event sequence with events from ESGN, and is a reward value for the individual sampled description . Then, the expected gradient on the sample set is given by

(13)

We adopt a reward function with two levels: episode and event levels. This encourages models to generate coherent captions by reflecting the overall context of videos, while facilitating the choices of better word candidates in describing individual events depending on the context. Also, motivated by [6, 16, 17]

, we use the rewards obtained from the captions generated with ground-truth proposals as baselines, which is helpful to reduce the variance of the gradient estimate. This drives models to generate captions at least as competitive as the ones generated from ground-truth proposals, although the intervals of event proposals are not exactly aligned with those of ground-truth proposals. Specifically, for a sampled event sequence

, we find a reference event sequence and its descriptions , where the reference event is given by one of the ground-truth proposals with the highest overlapping ratio with sampled event . Then, the reward for the sampled description is given by

R (14)

where returns a similarity score between two captions or two set of captions, and denote the generated descriptions from the reference event sequence. Both terms in Eq. (14

) encourage our model to increase the probability of sampled descriptions whose scores are higher than the results of generated captions from the ground-truth event proposals. Note that the first and second terms are computed on the current event and episode, respectively. We use two famous captioning metrics, METEOR and CIDEr, to define

.

5 Experiments

5.1 Dataset

We evaluate the proposed algorithm on the ActivityNet Captions dataset [8], which contains 20k YouTube videos with an average length of 120 seconds. The dataset consists of 10,024, 4,926 and 5,044 videos for training, validation and test splits, respectively. The videos have 3.65 temporally localized events and descriptions on average, where the average length of the descriptions is 13.48 words.

5.2 Metrics

We use the performance evaluation tool111https://github.com/ranjaykrishna/densevid_eval provided by the 2018 ActivityNet Captions Challenge, which measures the capability to localize and describe events222On 11/02/2017, the official evaluation tool fixed a critical issue; only one out of multiple incorrect predictions for each video was counted. This leads to performance overestimation of [27, 37]. Thus, we received raw results from the authors and reported the scores measured by the new metric.. For evaluation, we measure recall and precision of event proposal detection, and METEOR, CIDEr and BLEU of dense video captioning. The scores of the metrics are summarized via their averages based on tIoU thresholds of and given identified proposals and generated captions. We use METEOR as the primary metric for comparison, since it is known to be more correlated to human judgments than others when only a small number of reference descriptions are available [21].

5.3 Implementation Details

For EPN, we use a two-layer GRU with 512 dimensional hidden states and generate 128 proposals at each ending segment, which makes the dimensionality of in Eq. (9) 128. In our implementation, EPN based on SST takes a whole span of video for training as an input to the network, this allows the network to consider all ground-truth proposals, while the original SST [2] is trained with densely sampled clips given by the sliding window method.

For ESGN, we adopt a single-layer GRU and a single-layer LSTM as EncoderRNN and , respectively, where the dimensions of hidden states are both 512. We represent the location feature, denoted by , of proposals with a 100 dimensional vector. When learning SGN with reinforcement learning, we sample 100 event sequences for each video and generate one caption for each event in the event sequence with a greedy decoding. In all experiments, we use Adam [7] to learn models with the mini-batch size 1 video and the learning rate 0.0005.

5.4 Comparison with Other Methods

Audio Flow Visual Ensemble METEOR
RUC+CMU yes 8.53
YH Technologies no 8.13
Shandong Univ. yes 8.11
SDVC (ours) no 8.19
Table 3: Results on ActivityNet Captions evaluation server.
Method Proposal modules Captioning modules Number of Recall Precision METEOR
EPN ESGN eventRNN episodeRNN RL proposals
EPN-Ind 77.99 84.97 28.10 4.58
ESGN-Ind 2.85 55.58 57.57 6.73
ESGN-SCN 2.85 55.58 57.57 6.92
ESGN-SCN-RL (SDVC) 2.85 55.58 57.57 8.82
Table 4: Ablation results of mean averaged recall, precision and METEOR over four tIoU thresholds of 0.3, 0.5, 0.7 and 0.9 on the ActivityNet Captions validation set. We also present the number of proposals in average. The bold-faced number means the best performance.

We compare the proposed Streamlined Dense Video Captioning (SDVC) algorithm with several existing state-of-the-art methods including DCE [8], DVC [10], Masked Transformer [37] and TDA-CG [27]. We additionally report the results of MFT [30], which is originally proposed for video paragraph generation but its event selection module is also able to generate an event sequence from the candidate event proposals; it makes a choice between selecting each proposal for caption generation and skipping it, and constructs an event sequence implicitly. For MFT, we compare performances in both event detection and dense captioning.

Table 1 presents the event detection performances of ESGN and MFT in ActivityNet Captions validation set. ESGN outperforms the progressive event selection module in MFT on most tIoUs with large margins, especially in recall. This validates the effectiveness of our proposed event sequence selection algorithm.

Table 2 illustrates performances of dense video captioning algorithms evaluated on ActivityNet Captions validation set. We measure the scores with both ground-truth proposals and learned ones, where the number of the predicted proposals in individual algorithms may be different; DCE, DVC, Masked Transformer and TDA-CG uses 1,000, 1,000, 226.78 and 97.61 proposals in average, respectively, while the average number of proposals in SDVC is only 2.85. According to Table 2, SDVC improves the quality of captions significantly compared to all other methods. Masked Transformer achieves comparable performance to ours using ground-truth proposals, but does not work well with learned proposals. Note that it uses optical flow features in addition to visual features, while SDVC is only trained on visual features. Since the motion information from optical flow features consistently improves the performances in other video understanding tasks [12, 18], incorporating motion information to our model may lead to additional performance gain. MFT has the highest METEOR score among existing methods, which is partly because MFT considers temporal dependency across captions.

Table 3 shows the test split results from the evaluation server. SDVC achieves competitive performance based only on basic visual features while other methods exploit additional modalities (e.g., audio and optical flow) to represent videos and/or ensemble models to boost accuracy as described in [5].

5.5 Ablation Studies

We perform several ablation studies on ActivityNet Captions validation set to investigate the contributions of individual components in our algorithm. In this experiment, we train the following four variants of our model: 1) EPN-Ind: generating captions independently from all candidate event proposals, which is a baseline similar to most existing frameworks, 2) ESGN-Ind: generating captions independently using eventRNN only from the events within the event sequence identified by our ESGN, 3) ESGN-SCN: generating captions sequentially using our hierarchical RNN from the detected event sequence, and 4) ESGN-SCN-RL: our full model (SDVC) that uses reinforcement learning to further optimize the captioning network.

Table 4 summarizes the results from this ablation study, and we have the following observations. First, the approach based on ESGN (ESGN-Ind) is more effective than the baseline that simply relies on all event proposals (EPN-Ind). Also, ESGN reduces the number of candidate proposals significantly, from 77.99 to 2.85 in average, with substantial increase in METEOR score, which indicates that ESGN successfully identifies event sequences from candidate event proposals. Second, context modeling through hierarchical structure (i.e., ) in a captioning network (ESGN-SCN) enhances performance compared to the method with independent caption generation without considering context (ESGN-Ind). Finally, ESGN-SCN-RL successfully integrates reinforcement learning to effectively improve the quality of generated captions.

Event-level reward Episode-level reward METEOR
8.73
8.29
8.82
Table 5: Performance comparison varying reward levels in reinforcement learning on the ActivityNet Captions dataset.
Figure 4: Qualitative results on ActivityNet Captions dataset. The arrows represent ground-truth events (red) and events in the predicted event sequence from our event sequence generation network (blue) for input videos. Note that the events in the event sequence are selected in the order of its index. For the predicted events, we show the captions generated independently (ESGN-Ind) and sequentially (SDVC). More consistent captions are obtained by our sequential captioning network, where words for comparison are marked in bold-faced black.

We also analyze the impact of two reward levels—event and episode—used for reinforcement learning. The results are presented in Table 5, which clearly demonstrates the effectiveness of training with rewards from both levels.

5.6 Qualitative Results

Fig. 4 illustrates qualitative results, where the detected event sequences and generated captions are presented together. We compare the generated captions by our model (SDVC), which sequentially generates captions, with the model (ESGN-Ind) that generates descriptions independently from the detected event sequences. Note that the proposed ESGN effectively identifies event sequences for input videos and our sequential caption generation strategy facilitates to describe events more coherently by exploiting both visual and linguistic contexts. For instance, in the first example in Fig. 4, SDVC captures the linguistic context (‘two men’ in is represented by ‘they’ in both and ) as well as temporal dependency between events (an expression of ‘continue’ in ), while ESGN-Ind just recognizes and describes and as independently occurring events.

6 Conclusion

We presented a novel framework for dense video captioning, which considers visual and linguistic contexts for coherent caption generation by modeling temporal dependency across events in a video explicitly. Specifically, we introduced the event sequence generation network to detect a series of event proposals adaptively. Given the detected event sequence, a sequence of captions is generated by conditioning on preceding events in our sequential captioning network. We trained the captioning network in a supervised manner while further optimizing via reinforcement learning with two-level rewards for better context modeling. Our algorithm achieved the state-of-the-art accuracy on the ActivityNet Captions dataset in terms of METEOR.

Acknowledgments

This work was partly supported by Snap Inc., Korean ICT R&D program of the MSIP/IITP grant [2016-0-00563, 2017-0-01780], and SNU ASRI.

References

  • [1] Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. Hierarchical Boundary-Aware Neural Encoder for Video Captioning. In CVPR, 2017.
  • [2] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. SST: Single-Stream Temporal Action Proposals. In CVPR, 2017.
  • [3] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. DAPs: Deep Action Proposals for Action Understanding. In ECCV, 2016.
  • [4] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic Compositional Networks for Visual Captioning. In CVPR, 2017.
  • [5] Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Humam Alwassel, Victor Escorcia, Ranjay Khrisna, Shyamal Buch, and Cuong Duc Dao. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary. arXiv preprint arXiv:1808.03766, 2018.
  • [6] Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. Stack-Captioning: Coarse-to-Fine Learning for Image Captioning. In AAAI, 2018.
  • [7] Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
  • [8] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-Captioning Events in Videos. In ICCV, 2017.
  • [9] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: A New Dataset and Benchmark on Animated GIF Description. In CVPR, 2016.
  • [10] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. Jointly Localizing and Describing Events for Dense Video Captioning. In CVPR, 2018.
  • [11] Jonghwan Mun, Minsu Cho, and Bohyung Han.

    Text-Guided Attention Model for Image Captioning.

    In AAAI, 2017.
  • [12] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In CVPR, 2018.
  • [13] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In CVPR, 2016.
  • [14] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly Modeling Embedding and Translation to Bridge Video and Language. In CVPR, 2016.
  • [15] Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. Video Captioning with Transferred Semantic Attributes. In CVPR, 2017.
  • [16] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward. In CVPR, 2017.
  • [17] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-Critical Sequence Training for Image Captioning. In CVPR, 2017.
  • [18] Karen Simonyan and Andrew Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In NIPS, 2014.
  • [19] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning Spatiotemporal Features with 3D Convolutional Networks. In ICCV, 2015.
  • [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NIPS, 2017.
  • [21] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-Based Image Description Evaluation. In CVPR, 2015.
  • [22] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to Sequence-Video to Text. In ICCV, 2015.
  • [23] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating Videos to Natural Language using Deep Recurrent Neural Networks. In NAACL-HLT, 2015.
  • [24] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer Networks. In NIPS, 2015.
  • [25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and Tell: A Neural Image Caption Generator. In CVPR, 2015.
  • [26] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction Network for Video Captioning. In CVPR, 2018.
  • [27] Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In CVPR, 2018.
  • [28] Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. M3: Multimodal Memory Modelling for Video Captioning. In CVPR, 2018.
  • [29] Ronald J Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural computation, 1(2):270–280, 1989.
  • [30] Yilei Xiong, Bo Dai, and Dahua Lin. Move Forward and Tell: A Progressive Generator of Video Descriptions. In ECCV, 2018.
  • [31] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 2016.
  • [32] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In ICML, 2015.
  • [33] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing Videos by Exploiting Temporal Structure. In ICCV, 2015.
  • [34] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting Image Captioning with Attributes. In ICCV, 2017.
  • [35] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video Paragraph Captioning using Hierarchical Recurrent Neural Networks. In CVPR, 2016.
  • [36] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering. In CVPR, 2017.
  • [37] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-End Dense Video Captioning with Masked Transformer. In CVPR, 2018.

Appendix A Details of Event RNN

As described in Section 3.4 of the main paper, the event RNN is in charge of generating a description given an event and a context information, and returning features for the generated captions. This section discusses the caption generation process of the event RNN in detail.

Following [27], we provide two kinds of event information, and , to the event RNN, where is a set of segment-level feature descriptors in an interval of an event , and is a visual representation obtained from the event proposal network, SST. We first set an initial hidden state of the event RNN to the context vector of episode given by the episode RNN. Then, at each time step of the event RNN, we perform Temporal Dynamic Attention (TDA) to obtain an attentive segment-level feature from , followed by Context Gating (CG) to adaptively model relative contributions of the attentive segment-level feature and the visual feature and return a gated event feature. Based on the gated event feature, the event RNN generates a word, and returns the hidden state as the caption feature when generating the END token .

The whole caption generation process in the event RNN is summarized by the following sequence of operations:

(15)
(16)
(17)
(18)
(19)
(20)

where and are learnable parameters, means a hidden state of in the event RNN, and , , , and

denote an input word, a word embedding vector, an attentive segment-level feature vector, a gated event feature vector and a probability distribution over vocabulary at time

, respectively. At time step , given an event with segments, TDA computes the attentive vector by

(21)
(22)
(23)

where , , and are learnable parameters, and indicates the segment in event . Once obtaining the attentive segment-level feature, CG computes the gating vector and the gated event vector as follows:

(24)
(25)
(26)
(27)

where , and are learnable parameters,

is a sigmoid function,

denotes vector concatenation and means element-wise multiplication.

Figure 5: Examples of the selected event proposals (red) out of the candidates (black) in the proposed event sequence generation network and the ground-truth events (blue).

Appendix B Visualization of Event Selection

Fig. 5 illustrates the event selection results. Our proposed event sequence generation network successfully identifies the event proposals out of the candidates, which highly overlap with the ground-truths.