The task of dense video captioning  aims to generate a sequence of sentences to describe a series of events in the video. A typical framework for dense video captioning  is based on two stages: 1) event proposal generation to detect potential events in the video, and 2) event caption generation to produce a sentence for each specific event.
Though dense-captioning events in the video is similar to the traditional video captioning task which generates a sentence for a video clip, directly deploying traditional video captioning models leads to poor performances due to lack of context in the video . Being aware of contexts of an event not only can provide holistic information to understand the event more accurately, but also tell differences between target event and its context to generate more diverse captions. Therefore, previous endeavors have employed different contexts for the event captioning. Krishna et al.  propose to generate event proposal first and then dynamically select neighboring events as context for target event captioning. Our previous work  has proposed to implicitly encode global video contexts into each segment feature via LSTM, as well as explicitly employ local temporal regions of the target event as contexts. Besides visual contexts, Mun et al.  further consider sentence contexts from previous captions to improve diversity and coherency. However, there are no comprehensive evaluations on contributions of different contexts for event caption generation.
In this work, we systematically explore and compare different contexts for dense-captioning events in video. We design five types of context including segment-level contextual feature, local context, global context, event context and sentence context. Two broad categories of event captioning models are proposed to employ different contextual information, namely intra-event captioning models and inter-events captioning models. We carry out extensive experiments on the ActivityNet Captions dataset to evaluate the contributions of different contexts and captioning models from both accuracy and diversity aspects. Our preliminary experiments suggest that inter-events models can generate more diverse event descriptions than intra-event models, but are slower and achieve slightly worse accuracy in terms of METEOR metric than intra-event models. Therefore, the intra-event models are more suitable for the evaluation of the challenge which focuses on the METEOR performance. We plug the proposed intra-event models with contexts into our dense video captioning pipeline, and achieve state-of-the-art performance on the challenge testing set.
The paper is organized as follows. In Section 2, we introduce the proposed contextual types and event captioning models. Then in Section 3 we describe the overall pipeline of our dense video captioning system for the challenge. Section 4 presents the experimental results and analysis. Finally, we conclude the paper in Section 5.
2 Event Caption Generation with Contexts
The context has played an important role for dense video captioning. Although previous works [1, 2, 3] have employed different contexts for event caption generation, no systematic evaluations on the contribution of different contextual information has been investigated. In this section, we make a thorough comparison of different contexts and modeling approaches for dense event captioning.
We mainly divide the contexts for dense event captioning into five categories as follows:
Segment-level contextual feature: enhance segment-level feature with local or global video contexts such as features from LSTM described in Section 3.1. Such features are event agnostic, and contain larger temporal receptive fields than isolated segment-level features.
Local context: encode temporally neighboring video contents for target event captioning. The local neighboring regions can provide necessary antecedents and consequences for understanding an event. Such context only relies on the target event itself to compute.
Global context: encode the global video content except the target event as context. It provides an overall picture of the whole video outside the event.
Event context: encode neighboring events of the target event as context. It considers the correlations of different event proposals. The difference between event context and local context is that the local context does not necessarily to be potential events that capture complete actions. So local context does not require to know other events in the video while the event context does.
Sentence context: encode generated event sentence descriptions as context. It considers what has been said in the past events for the target event captioning, which aims to improve diversity and coherency of the generated event captions. So it requires to know the past events and past generated captions.
The first three types of context do not rely on other events except the target event, while the latter two contexts requires to be aware of other detected events in the video. Therefore, in order to employ the above contexts, the event captioning models can be categorized into two types, namely intra-event models and inter-events models.
The intra-event model is illustrated in Figure 1, which can employ either segment-level contextual features, local context and global context or their combinations. It only requires the target event proposal for generation, so that different event proposals in one video can be processed in parallel for speed up. The inter-events model however requires the existence of other events in video for target event captioning as shown in Figure 2. We propose an event encoder to capture contexts from different events, which can be uni-directional in on-line setting or bi-directional in off-line setting. Then a shared sentence decoder is employed to generate event descriptions for each event. To capture sentence contexts, the generated caption from the previous event can be further fed into the event encoder, the structure of which is similar to Mun et al. . Therefore, the inter-events models with sentence contexts are processed in sequential order which cannot be paralleled in training and inference and leads to slower speed. In this work, we utilize the temporal attention sentence decoder  which dynamically attends to relevant temporal segment of the target event for event captioning.
3 Dense Video Captioning System
The framework of our dense video captioning system is based on our previous endeavor 
in ActivityNet Challenge 2018, which consists of four components: 1) segment feature extraction; 2) event proposal generation; 2) event caption generation; and 4) proposal and caption re-ranking.
3.1 Segment Feature Extraction
We divide each video into non-overlapping segments with 64 frames per segment, and extract four sets of features for each segment. The four sets are: 1) basic feature set that captures global content of each isolated segment from different modalities; 2) object feature set that captures fine-grained object-level features of the segment; 3) semantic feature set that represents each segment as semantic concept distribution; and 4) context feature set that captures contextual information for each segment. In the following, we describe each feature set in details.
from image modality pretrained on ImageNet dataset; 2) I3D from motion modality pretrained on Kinetics dataset; and 3) VGGish  from acoustic modality pretrained on Youtube8M dataset. These three features are temporally aligned and are concatenated together as for the -th segment. Therefore, the video is converted into .
The object feature set aims to capture more detailed spatial information in the segment. We utilize Faster R-CNN  pretrained on the VisualGenome dataset to detect objects in the mid-frame of the segment. We only keep object classes that overlap with frequent nouns in ActivityNet Caption dataset, which leads to less than 10 objects per image. So we apply mean pooling over the extracted object features of -th segment as the feature.
The semantic feature set is to decrease the semantic gap between multimodal video representations and language. We first select frequent nouns and verbs in the training set of ActivityNet Caption dataset, which include concepts. Since concept annotations of each event proposal are weakly supervised where the correspondence of segments in the proposal and concepts are unknown, we formulate the concept prediction problem as a multi-instance multi-label task. Our concept predictor is based on the above segment-level features and . For each event proposal, we evenly select
segments and utilize the concept predictor to generate concept probabilities for the selected segments. Max pooling is employed to obtain the proposal-level concept predictions. We use the binary cross entropy loss to train the concept predictor. After training, the concept predictor can generate semantic concept probabilitiesfor each segment.
The above three feature sets only represent contents in isolated segment without considering its contextual information in the whole video. Therefore, we further propose the contextual feature set to enhance the segment representation with video contexts. We train a LSTM on top of and sequences and the objective function of the LSTM is to predict concepts for each segment. The segment-level concepts are set the same as the corresponding proposal-level concepts. The hidden state of the LSTM is considered as context feature for the -th segment.
3.2 Event Proposal Generation
Accuracy and coverage are both important for event proposal generation. The proposal ranking model proposed in our previous work  carefully designs a set of features to score densely sampled proposals, which can generate proposals with high precision. However, this approach ignores the correlation of different event proposals, which makes the top ranked proposals similar with each other and leads to low coverage. A recent proposed event sequence generation network (ESGN)  utilizes pointer network to sequentially select event proposals step by step, which largely avoids redundancy of generated event proposals. Though it can achieve high coverage with few proposals, the precision is inferior to that of proposal ranking model .
Therefore, we propose to combine the two models in inference to mutually make up for their deficiencies. We firstly train the proposal ranking model and ESGN model. For the proposal ranking model, we rank proposals densely sampled by the sliding window approach following our previous work . For the ESGN model, we utilize top-80 proposals generated from the proposal ranking model as candidates and adopt algorithm in  for training. Then we utilize Algorithm 1 to generate event proposals based on the pretrained models.
3.3 Event Caption Generation
The event captioning models with different contexts in Section 2 are utilized for event caption generation. We firstly train the caption model based on groundtruth event proposals with cross entropy loss and then finetune the model with self-critical REINFORCE algorithms  with rewards from METEOR and CIDEr. In order to improve the generalization of caption models on the predicted proposals, we further augment the training data with predicted event proposal whose tIoU is larger than 0.3 with groundtruth proposals. The training caption for the predicted event proposal is the caption of the best matched groundtruth proposal.
|Model||Loc||Contexts||Accuracy Metrics||Diversity Metrics|
3.4 Proposal and Caption Re-ranking
In order to further improve performance, we train different captioning models and propose the following re-ranking approach to ensemble different models.
Proposal Re-rank: We first re-rank all the candidate proposals and choose the top 5 as our final proposals. We consider four factors in our proposal re-ranking, including: 1) proposal quality from proposal generation models; 2) describability of the proposal, which is represented by probability of generated sentence from captioning models; 3) position and 4) length.
Caption Re-rank: After selecting proposals, we re-rank captions of these proposals from different caption models. Two factors are considered in caption re-ranking, which are the number of unique words in a caption and the matching of generated words with predicted concepts in Section 3.1. We select the best caption for each proposal.
We utilize the ActivityNet Dense Caption dataset  dataset for dense video captioning, which consists of 20k videos in total with 3.7 event proposals per video on average. We follow the official split with 10,009 videos for training, 4,917 videos for validation and 5,044 videos for testing in the experiments except for our final testing submission. In the final submission, we enlarge the training set with 80% of validation set, which results in 14,009 videos for training and 917 videos for validation. The video in training set contains one set of event proposal segmentation while video in validation set contains two sets of proposal segmentation.
4.2 Evaluation of Contexts for Event Captioning
Experimental Setting: In order to purely evaluate the event captioning performance, we fix event proposals as the groundtruth proposals. We use as the segment-level isolated feature and
as the segment-level contextual feature. For the intra-event models, we set the hidden units of LSTM as 1,024, and hidden units of LSTM in inter-events models are set to be 512. We train each model for 100 epochs and select model with best METEOR score on the validation set.
Evaluation Metrics: We evaluate the caption quality from accuracy and diversity aspects. For the accuracy aspect, we employ the official evaluation process  with tiou threshold of 0.9 since we utilize the groundtruth proposals, and evaluate on common captioning metrics including BLEU4, METEOR and CIDEr. The higher the scores, the more accurate the captions are. For the diversity aspect, we evaluate the Self-BLEU (SelfB) and Repetition Evaluation (RE) 
. The SelfB measures the similarity of each sentence against rest sentences in the video via BLEU4. The RE computes the redundancy score of each n-gram in the video wherein this work. The lower the scores, the more diverse the captions of different event proposals are. Since the validation set contains two sets of event proposals, we evaluate two types of diversity score. The first utilizes the two sets separately and then averages on the two sets, which are denoted as SelfB and RE. The second combines the two sets of event proposals for diversity measure, which are denoted as SelfB2 and RE2. Since event proposals from two sets can cover similar events, the video-level diversity of the second type is supposed to be lower than the first type.
|Method||# p||Precision (@tIoU)||Recall (@tIoU)|
Experimental Results: Table 1 shows captioning performance with different contexts based on groundtruth event proposals. The first row in intra-event block reflects the traditional video captioning model without considering any context or location of the event, which achieves poor captioning performance on accuracy and diversity. The second row employs the segment-level contextual information and outperforms the first row, which demonstrates the importance of context to improve captioning quality. Explicitly encoding local context for target event captioning can further improves the captioning performance especially on the diversity. We also find that being aware of the location information is beneficial. However, the global context is not complementary with the above contexts and in particular deteriorates the diversity. It might result from using too many irrelevant contextual information. We will explore to dynamically attend to different contexts in our future work.
For the inter-events models, the first row in the block is similar to the captioning model proposed in , which utilizes both visual event context and textual sentence context, while the second row only employs the event context. Our results suggest that the sentence context might not be as useful as event context in terms of accuracy and diversity. What is more, using sentence context can slow down training and inference due to its sequential nature. The event context is very promising to improve the diversity of event captions especially on SelfB2 and RE2 metrics. The SelfB2 and RE2 are evaluated with two sets of event proposal segmentation which contain more similar proposals, so the lower diversity scores on these metrics indicate that the model can distinguish fine-grained events in different event segmentation.
Finally, we compare the performance of intra-event models and inter-events models enhanced by different contexts. In terms of the accuracy measured by METEOR, intra-event models are slightly better or comparable with inter-events models. However, the inter-events models can generate more diverse event captions than intra-event models. Since the evaluation metric in the AcitivityNet Captioning challenge mainly considers accuracy with METEOR metric, we adopt the intra-event models as our event captioning model. In the future, we will explore more on inter-events models for dense video captioning.
4.3 Evaluation of Dense Video Captioning System
Evaluation Metrics: We utilize official evaluation metrics  to evaluate captions of predicted proposals, which compute the caption performance for proposals possessing tiou 0.3, 0.5, 0.7 and 0.9 with the groundtruth.
Experimental Results: Table 2 presents our performance for event proposal generation with
. To be noted, the precision and recall are evaluated using the union of two sets of event proposal annotations in the validation set instead of selecting the best from the two sets. The fusion of the two proposal generation models can balance the precision and recall better than each single model. In Table 3, we present the improvements from different training methods using the best intra-event model in Table 1
, which demonstrates the effectiveness of the reinforcement learning and data augmentation. The re-ranking performance is presented in Table4, which ensembles different captioning models trained with different combination of features and contexts. The performance of our submitted model is presented in Table 5. More training data brings substantial improvement, and our model achieves 9.91 METEOR score on the testing set.
|proposal re-rank||5||10.96 (+0.80)||11.59 (+0.70)|
|caption re-rank||5||11.46 (+0.50)||12.24 (+0.65)|
In this work, we systematically evaluate contributions from different contextual information for dense video captioning. Our preliminary experiments show that the segment-level context, local context and event context are the most beneficial contextual types. The inter-events models are promising to generate more diverse event captions while the intra-event models are faster and achieve slightly better accuracy in terms of METEOR for event captioning. Our proposed system achieves significant improvements on the dense video captioning challenge. In the future, we will explore to dynamically encode different contexts and improve intra-event and inter-events caption models.
This work was supported by National Natural Science Foundation of China under Grant No. 61772535 and National Key Research and Development Plan under Grant No. 2016YFB1001202.
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles.
Dense-captioning events in videos.
Proceedings of the IEEE International Conference on Computer Vision, volume 1, page 6, 2017.
-  Shizhe Chen, Yuqing Song, Yida Zhao, Jiarong Qiu, Qin Jin, and Alexander Hauptmann. Ruc+ cmu: System report for dense captioning events in videos. arXiv preprint arXiv:1806.08854, 2018.
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han.
Streamlined dense video captioning.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision, pages 4507–4515, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733. IEEE, 2017.
-  Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 131–135. IEEE, 2017.
-  Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077–6086, 2018.
-  Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR, volume 1, page 3, 2017.
-  Yilei Xiong, Bo Dai, and Dahua Lin. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 468–483, 2018.