The task of dense video captioning  aims to describe videos with a sequence of sentences rather than a single caption as in traditional video captioning. To generate informative video descriptions, it is important to first detect meaningful events in the untrimmed video.
mainly adopt a two-stage method to detect the events, including the candidate proposal generation stage and proposal selection stage. The sliding windows or neural networks such as SST
are first used to propose a large amount of event candidates. Then event classifiers are designed to predict event confidence for each candidate. The event proposals with confidences higher than a fixed threshold will be selected as the final events. There are two major drawbacks of such two-stage approach: First, a large amount of candidates (about 1000) need to be generated to ensure covering all the possible events, which is not efficient and computationally expensive. Second, it does not consider temporal relationships between the events which may lead to candidates with high redundancy. However, the events sequence in the video usually follows temporal orders as shown in Figure1. We make statistics on the events sequential orders in ActivityNet dataset  based on the start and end timestamp of each event. As shown in Table 1, about 81.5% of videos in AcvitityNet dataset contain events in a clearly sequential order (one after another), while 16.94% of videos contain events in a “Summary-Details” order. Only 1.16% of videos contain events not in any order. Therefore, the events detection can be viewed as a sequence generation problem to directly generate event boundaries one by one.
In this work, we propose a novel and simple event sequence generation model, which fully exploits bidirectional temporal dependency of each event to generate event boundaries directly. It takes previous event as the input and predicts next event distribution over the whole video timeline at each decoding step conditioned on the encoded videos. To exploit both the past and future event contexts, we generate the event sequence in both forward and backward directions and then fuse the distribution maps in bi-directions to generate final event boundaries. Experiments on ActivityNet Captions dataset demonstrate the proposed event sequence generation model can generate more accurate and diverse events with much less redundancy. With the generated events sequence, the intra-event captioning models with contexts as in our previous work  are further employed to generate descriptions for each event. To take advantages of different captioning models, we utilize a video-semantic matching model to evaluate and choose more relevant captions from different models for each event. The whole dense video captioning pipeline achieves the state-of-the-art performance on the challenge testing set.
The paper is organized as follows. In Section 2, we describe the whole dense video captioning system, which contains the event sequence generation module, the event captioning module and the re-ranking module. Section 3 presents the experimental results and analysis. Finally, we conclude the paper in Section 4.
2 Dense Video Captioning System
The whole framework of our dense video captioning system in ActivityNet Challenge 2020 consists of four components: 1) segment feature extraction; 2) event sequence generation; 3) event caption generation; and 4) event and caption re-ranking.
2.1 Segment Feature Extraction
from image modality pretrained on ImageNet dataset; 2) I3D from motion modality pretrained on Kinetics dataset; and 3) VGGish  from acoustic modality pretrained on Youtube8M dataset. These three features are temporally aligned and are concatenated together as for the -th segment. Therefore, the video is converted into , which is then used in the following modules.
2.2 Event Sequence Generation
To generate the event sequence with bidirectional video contexts, we further encode the segment-level features into context-aware features . We employ a bidirectional GRU  on the segment-level features sequence to capture the visual context in both forward and backward directions. The hidden states in two directions at each encoding step and are concatenated and added to the segment-level feature
followed by ReLU. Therefore, the context-aware featurefor the -th segment can be expressed as:
Conditioned on the context-aware video features , we then generate events by another GRU one by one. We represent the
-th event as a T-dimensional binary feature vector, where T is the number of video segments. The value is set as 1 if the -th segment is included in the -th event interval, otherwise it is set as 0. We utilize all-zeros vector as the special start event and special end event .
We initialize the hidden state of the event decoder with global video feature . The event decoder uses previous generated event as input and predicts the -th event distribution over the whole video timeline as follows:
where MLP is multilayer perceptron,is the indicator vector and
is the sigmoid function.
denotes the probability of the-th segment included in the -th event. The timestamp of is therefore represented as [, ], where and are the first and last probability over 0.5. In such way, the event decoder can generate the event sequence one by one until is the special end event .
The binary cross entropy is utilized to optimize the event distribution as follows:
where is the number of events in the -th video, and represents all the learnable parameters in event sequence generation module. For faster learning, we utilize the teacher forcing training strategy by feeding the ground truth event in each step.
In such forward generating direction, we generate the events sequence one by one only depending on the past events, which ignores the future event contexts. Therefore, we train another event generator with the whole video reversed, and generate the events sequence in the backward direction, which exploits future events for each event prediction. Finally, we match corresponding events in forward and backward directions with the tiou over 0.5 and average the predicted event distributional vectors in two directions to acquire the final event boundaries. Figure 2 illustrates the framework of event sequence generation module.
2.3 Event Caption Generation
The context also plays an important role in event caption generation. Besides the basic segment-level video features described in Section 2.1, we also capture the contextual information in the whole video with RNN as in . We train a LSTM on segment-level feature sequence and the objective function is to predict concepts for each segment. We take the hidden state of the LSTM as context feature for the -th segment, and it is further concatenated with for the following caption generation.
As analysed in our previous work , the intra-event captioning models are faster and perform better than the inter-event captioning models. Therefore, similar to , we adopt intra-event captioning model with local contexts for the event caption generation. We adopt a two-layer stacked GRU as the decoder. The first GRU layer in the decoder is the attention GRU, which takes the previous generated caption word and previous hidden state in the second GRU layer as inputs to calculate a query vector as follows:
where is initialized as the mean pooling of video features in the current event concatenated with local contexts. The query vector is utilized to select relevant temporal contexts with attention mechanism. Then the second GRU layer predicts the next caption word with the temporal context as follows:
where is the word embedding matrix.
2.4 Event and Caption Re-ranking
In order to improve the system robustness and further improve performance, we train different captioning models and propose the following re-ranking approach to ensemble different models.
Event Re-rank: Since the precision of event proposals is vital to the final dense captioning performance, we combine the events generated by our proposed ESG module with the proposals in our previous work . We adopt the same proposal re-rank policy in  to get the final events.
Caption Re-rank: With the fixed event proposals, we further re-rank captions generated by different captioning models for each event. To select more accurate and visual relevant captions, we train a video-semantic matching model  on the ActivityNet caption dataset to evaluate the qualities of generated captions. Finally, we choose the best caption based on the predicted score and the number of unique words for each event.
|Methods||with GT proposals||with generated proposals|
|GT events on val||15.00||16.10|
|Generated events on val||11.28||12.27|
|Generated events on test||9.336||9.894|
We utilize the ActivityNet Dense Caption dataset  for dense video captioning, which consists of 20k videos in total with 3.65 event proposals per video on average. We follow the official split with 10,009 videos for training, 4,917 videos for validation and 5,044 videos for testing in the experiments except for our final testing submission. In the final submission, we enlarge the training set with 80% of validation set, which results in 14,009 videos for training and 917 videos for validation. The video in training set contains one set of event proposal segmentation while video in validation set contains two sets of proposal segmentation.
3.2 Evaluation of Event Proposal Generation
We set the hidden units of GRU as 512. There are two hidden layers in MLP with ReLU activation. The maximum length of event sequence is set as 8. Dropout of 0.5 is adopted to avoid the over-fitting. We train the event proposal generation module for 30 epochs with the mini-batch size 8 videos and the learning rate 1e-4.
Evaluation Metrics: We evaluate the performance of predicted events by measuring the recall and precision of proposals which have tiou 0.3, 0.5, 0.7 and 0.9 with the ground-truth.
Experimental Results: Table 2 presents our performance for event proposal generation. Our proposed bi-directional event sequence generation model performs better than the two-stage method . Furthermore, it is simpler and more efficient. It generates 2.89 events per video on average, and the self-tiou of them is 0.07, which is close to the ground-truth events with self-tiou 0.05. Generating the event sequence in forward and backward directions achieve competitive proposal performances. Combining the two directions achieves the best performance on both average recall and precision, which shows the past and future events are both helpful for the current event prediction.
3.3 Evaluation of Event Captioning
Implementation Details: For the caption decoder, we set the hidden units of GRU as 1024. The dimensionality of word embedding layer is set to 300. We initialize the embedding matrix with the pre-trained Glove . We train the captioning model for 30 epochs and select the model with best METEOR score on the validation set.
For the video-semantic matching model, the dimensionality of video-sentence joint embedding space is set as 1024. Contrastive ranking loss with hard negative mining  is utilized for training.
Evaluation Metrics: To purely evaluate the event captioning performance, we fix event proposals as the ground-truth proposals. We employ the official evaluation process  with tiou threshold of 0.9 since we utilize the ground-truth proposals, and evaluate on common captioning metrics including BLEU, METEOR and CIDEr. When evaluating the caption performance of generated events, we compute the caption performance for proposals possessing tiou 0.3, 0.5, 0.7 and 0.9 with the ground-truth.
Experimental Results: Table 3 shows our dense captioning performances on the ground-truth events and generated events. The intra-event captioning model trained with cross-entropy loss has achieved competitive performance with the METEOR 12.42. Fine-tuning the model with the rewards computed by CIDEr and METEOR metrics in reinforcement learning framework further improves the captioning model significantly. Ensembling various captioning models with caption re-ranking achieves additional improvements over all the single-models. Compared with the performance on ground-truth events, the captioning performances on the generated proposals are much inferior, which infers the importance of event proposals generation.
The performances of our last two submitted models are presented in Table 4. In our final submission, we enlarge the training data with 80% of validation set. More training data brings substantial improvement, and our model achieves 9.894 METEOR score on the testing set.
In this work, we explore the temporal order of the event sequence in the video. With fully exploiting temporal dependence in two directions, we propose a novel and simple event sequence generation model without traditional two-stage process. For event captioning, we adopt the intra-event captioning models as our previous work  and employ a video-semantic matching model to re-rank captions for each event. Our proposed system achieves the state-of-the-art performance on the dense video captioning challenge 2020. In the future, we will further explore the coherence of multiple captions for the event sequence.
-  (2017) SST: single-stream temporal action proposals. In , pp. 6373–6382. External Links: Cited by: §1.
-  (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 4724–4733. Cited by: §2.1.
-  (2019) Activitynet 2019 task 3: exploring contexts for dense captioning events in videos. CoRR abs/1907.05092. External Links: Cited by: Team RUC AIM Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning, §1, §2.1, §2.3, §2.3, §2.4, §4.
Learning phrase representations using RNN encoder–decoder for statistical machine translation.
Empirical Methods in Natural Language Processing, pp. 1724–1734. Cited by: §2.2.
-  (2017) Vse++: improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612 2 (7), pp. 8. Cited by: §2.4, §3.3.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. Cited by: §2.1.
-  (2017) CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pp. 131–135. Cited by: §2.1.
-  (2017) Dense-captioning events in videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 706–715. Cited by: §1, §1, §3.1, §3.3.
-  (2018) Jointly localizing and describing events for dense video captioning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7492–7500. Cited by: §1.
-  (2019) Streamlined dense video captioning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6588–6597. External Links: Cited by: §1, §3.2, Table 2, Table 3.
-  (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §3.3.
Self-critical sequence training for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1179–1195. Cited by: §2.3.
-  (2018) Bidirectional attentive fusion with context gating for dense video captioning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7190–7198. Cited by: §1.