Semantic-Aware Pretraining for Dense Video Captioning

by   Teng Wang, et al.
The University of Hong Kong

This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021. We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts. Diverse video features of different modalities are fed into an event captioning module to generate accurate and meaningful sentences. Our final ensemble model achieves a 10.00 METEOR score on the test set.


page 1

page 2

page 3


Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

This technical report presents a brief description of our submission to ...

Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning

Detecting meaningful events in an untrimmed video is essential for dense...

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understan...

Multi-modal Feature Fusion with Feature Attention for VATEX Captioning Challenge 2020

This report describes our model for VATEX Captioning Challenge 2020. Fir...

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Given the features of a video, recurrent neural network can be used to a...

Multimodal Pretraining for Dense Video Captioning

Learning specific hands-on skills such as cooking, car maintenance, and ...

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

This notebook paper presents our model in the VATEX video captioning cha...

1 Approach

The goal of dense video captioning is to detect and describe all the events in an untrimmed video, which relies on rich spatial-temporal video features. We extract the clip-level features of the raw video by various encoders with different modalities and different pretraining tasks. The final video representation is obtained by the concatenation of selected features along the channel axis. Afterward, we adopt an off-the-shelf dense video captioning model [9] to generate the locations and captions for multiple event proposals. Finally, we ensemble the generated captions from different models for a further performance boost.

1.1 Semantic-Aware Pretraining

Mainstream dense video captioning methods adopt video encoders pre-trained on action classification datasets, where action-oriented supervisions guide the model to the focus on the motion and appearance features related to a limited number of action classes. However, these methods fail to explicitly model the fine-grained semantic components, like objects, numbers, and colors, which are essential for caption generation. In this report, we propose a pre-training task named semantic concept classification (SCC), serving as an auxiliary objective for a semantic-aware video feature extractor.

We build the video encoder based on TSP [1], a supervised pretraining paradigm for temporally-sensitive representation learning for untrimmed videos. Given a video clip, the video encoder aims to predict 1) the foreground class of the clip (action classification), 2) whether the clip is in the foreground or the background of the video (temporal region classification), and 3) the semantic concepts in the clip. The overall pretraining objective is:


where and represent the cross-entropy losses for action classification and temporal region classification, and is the multi-label classification loss for SCC. We use R(2+1)D-34 [8]

as the backbone of the video encoder. The predicted probability of semantic concepts is obtained by an FC+Sigmoid layer onto the local clip features encoded by R(2+1)D-34. The overall architecture is shown in Figure


To construct semantic labels, we select the top frequent nouns, verbs, adjectives, and adverbs from the ground-truth sentences in ActivityNet Captions as the vocabulary of semantic concepts. Since most clips contain few positive semantic labels, the negative ones will dominate the loss and hurt the representation ability of the model. To tackle the positive-negative imbalance problem, we employ the asymmetric loss [2] as .

Figure 1: The proposed pretraining strategy. We use R(2+1)D to produce the local clip features of a clip and the global features of the entire video. Then the local and global features are leveraged to perform three pretraining tasks, i.e., action classification, semantic concept classification, and temporal region classification. By doing so, the learned representation is semantic-aware and temporal-sensitive.
Video Features Modality Pretraining task Pretraining Dataset METEOR BLEU4 CIDEr

VGGish [6]
Audio Audio Cls. YouTube-8M 9.23 1.72 35.49
ResNet152 [5] RGB Image Cls. ImageNet 10.87 2.55 45.79
I3D [3] RGB + Flow Action Cls. Kinetics 11.43 2.79 49.77
TSN [10] RGB + Flow Action Cls. Kinetics + Anet1.3 11.49 2.85 49.34
SlowFast [4] RGB Action Cls. Kinetics 11.59 2.86 49.72
TSP [1] RGB Action Cls. + Temp. Cls. IG65m +Kinetics + Anet1.3 11.56 2.93 50.68
SA-TSP(ours) RGB Action Cls. + Temp. Cls. + Semantic Cls. IG65m +Kinetics + Anet1.3 11.70 3.10 52.16
Table 1: Comparison of different pretraining strategies for video feature encoder. The METEOR/BLEU4/CIDEr scores with ground-truth proposals are calculated on the validation set by the official evaluation toolkit.
with GT proposals
with learned proposals
Feature combination 12.28 8.16
+ Enlarged training set 12.33 8.03
+ SCST 15.89 11.42
+ Model ensemble 16.19 11.50
Table 2: Performance on the validation set.

1.2 Feature Combination

To increase the diversity of video features, we extract different types of video features across various input modalities (including RGB, optical flow, and audio) and pretraining tasks (including image classification, audio classification, action classification, temporal region classification, and semantic concept classification). Then, we find the best feature combination from the feature pool according to their captioning performance. The best feature combination is “SA-TSP + I3D + VGGish”.

1.3 Model Ensemble

We train event captioning models by varying the video feature combination strategy and the random seed. For each proposal, a caption set with a size of is generated and the best caption is selected by two criteria: the number of unique semantic concepts and the max inverse document frequency (IDF) of all unigrams.

1.4 Implementation Details

In the semantic-aware pretraining, we set the trade-off factors , in Eqn. (1) to be 1 and 0.1, respectively. The size of semantic concept vocabulary is 1000. We first train the R(2+1)D backbone on the IG65m+Kinetics dataset for action classification, then finetune the model based on the proposed pretraining objective in Eqn. (1) on the ActivityNet1.3 dataset.

For temporal action proposals, we directly use the predicted results provided by [9]

, achieving 40.09% and 66.63% in terms of precision and recall on the ActivityNet validation set. For event captioning, we use a vocabulary with a size of 8340. The size of all hidden layers is set to 512. We adopt the Adam optimizer with an initial learning rate of 1e-4. The batch size is set to 1 and the max training epoch is 40. The number of ensemble models

is 5.

2 Evaluation Results

Table 1 shows the captioning performance with a single type of feature. We observe that the similarity between the pretraining task and downstream task matters. The unsatisfying performance of VGGish and ResNet152 is probably caused by the large discrepancy between the audio/image classification and the dense video captioning. The proposed SA-TSP achieves the best results among the eight models, which verifies the effectiveness of the semantic concept classification loss.

Table 2 shows several techniques to boost the dense video captioning system. Feature combination yields a clear performance gain (11.7012.28) compared with the single-model performance. We further extend the official train set by appending around 80% of the videos from the validation set. Then we finetune the event captioning module with the self-critical sequence training (SCST) [7] on the enlarged train set, which gives a considerable performance gain both with ground-truth proposals and with learned proposals. The final submission is obtained by ensembling five different event captioning models. On the test set, our final submission achieves a 10.00 METEOR score.