Dense video captioning attracts increasing attention in recent years, whose goal is to localize and describe all events in an untrimmed video. Different from traditional video captioning which only described a single event, dense video captioning requires a comprehensive understanding of the long-term temporal structure and the semantic relationships among a sequence of events. Previous methods [1, 2, 7] have employed different types of contexts for constructing event representation, e.g., neighboring regions within an expanded receptive field , event-level semantic attention , and clip-level recurrent features [7, 1, 10]
. Although promising progress has been achieved, their context modeling missed the perception of the temporal structure of the event sequence, i.e. the temporal positions and lengths of other events. As a consequence, the temporal relationship between events is not fully exploited in the captioning stage. In this work, we propose to explore the temporal relation between events in the encoding phase. Furthermore, we also design a cross-modal gating (CMG) block for hierarchical RNN, which can adaptively estimate the weight of linguistic information and visual information for better caption generation.
The overall framework contains three parts, i.e., the feature extractor, the temporal event proposal model, and the event captioning model.
2.1 Feature Extractor
We divide the video into several non-overlapping clips with a stride of 0.5s and extract the frame-level features by a TSN
pretrained on ActivityNet datasets for the action recognition task. We concatenate the feature vectors in optical flow modality and RGB modality to construct the frame-level representation, which is utilized for the temporal event proposal module (TEP) and the event captioning module (EC).
2.2 Temporal Event Proposal
Accurate event proposals generation is the basis for further captioning. For TEP, we adopt an off-the-shell DBG  to detect the top 100 proposals for each video. Since the number of proposals in the ground-truth annotation is usually small (around 3.7 per video) in average, we follow Chen et al.  to perform a modified event sequence selection network (ESGN) 
to predict a subset of candidate proposals. The selection process can achieve a good balance between precision and recall, especially when the number of proposals is relatively small. After selection, the number of output proposals per video is around 2.4 on average. The average precision and recall on the validation set across tIOUis 66.63% and 40.09%, respectively.
2.3 Event Captioning
The event captioning model follows an encoder-decoder architecture. For the visual encoder, we follow Wang et al.  to adopt the temporal-semantic relation module (TSRM) to capture rich relationships between events in terms of both temporal structure and semantic meaning. For the language decoder, we develop a gated hierarchical RNN, aided by a cross-modal gate to strike a balance of visual information and linguistic information when captioning the next sentence. The encoder and the decoder are illustrated in Fig. 1 and Fig. 2, respectively.
Visual Encoder. TSRM contains two branches, i.e., a temporal relation branch and a semantic relation branch. Each branch estimates the relation score between the target proposal and the other proposals , and then we fuse the two types of scores by addition. The relational features of an event are obtained by weighted summation of features of all proposals conditioned on the relation scores.
For the temporal relation branch, we encode the relative length and the relative distance between each pair of proposals, and then put the position encoding into a non-linear embedding  and an FC layers with an output size of 1. For the semantic relation branch, we first obtain the proposals’ initial representation by the mean pooling of the frame-level features, then we adopt scaled dot-product attention  to calculate the semantic similarity between proposals.
The function of decoder is to translate the visual representation produced by the encoder into target modality. Different from models in traditional image/video captioning, the target output of the decoder in dense video captioning is a set of sentences instead of one. To increase the coherence among sentences, hierarchical RNN (a sentence RNN plus a word RNN) based deocder is widely used for multi-sentence captioning[11, 4, 1]. The sentence RNN stores multi-modal information of all previous events and guides the generation of the next sentence. To enhance the multi-modal fusion, we propose a cross-modal gating (CMG) block to adaptively balance the visual and linguistic information. Specifically, the inputs of cross-modal gating block are four folds: 1) the position embedding of proposal , 2) the proposal’s feature vector , which is the concatenation of the output of TSRM module and the mean pooling of the frame-level features within , 3) the last hidden state in the word RNN of the previous sentence, and 4) the previous hidden state of the sentence RNN. Motivated by Wang et al. , we use gating mechanism to balance the linguistic information and the visual information . The gate
is calculated by an FC layer with sigmoid activation function. We useand to gate the information of visual features and linguistic features, respectively. The word RNN is implemented as an attention-enhanced RNN, which adaptively select the salient frames within the proposal for word prediction.
|Validation set||Test set|
|GT prop.||Learnt prop.||Learnt prop.|
|+ Larger train. set||14.64(+0.46)||10.26(+0.55)||9.17|
3.1 Experimental Setting
Inplementation Details. For data processing, we build a vocabulary that only takes into consideration those words that occurred at least 5 times. Sentences longer than 30 words have been truncated. For the event captioning model, we adopt LSTM as the RNN in the decoder. The hidden units of LSTMs and all FC layers are set to be 512.
We first train the event captioning module based on ground-truth proposals using cross-entropy loss. Afterwards, to address the exposure bias problem and boost the performance, we continually train the model by self-critical sequence training (SCST)  based on learnt event sequences. For each video, we sample 24 event sequences from the output results of DBG  for SCST. The reward is set to be the METEOR score.
Dataset. We use the ActivityNet Captions dataset to evaluate the performance of our method. We follow the official split, which assigns 10,009/4,917/5,044 videos for training/validation/testing. In our final submission, we train the event captioning model using SCST with an enlarged training set. Specifically, we randomly select 80% videos from the validation set and add them to the official training set. The modified split contains 13,926/1,000 videos for training/validation.
We use the official evaluation tool to measure the ability of our model in both localizing and captioning events. Specifically, the average precision is computed across tIoU thresholds of 0.3, 0.5, 0.7, and 0.9. The precision of generated captions is measured by traditional evaluation metrics in video captioning: BLEU, METEOR, and CIDEr.
3.2 Performance Evaluation
We show the ablation study for event captioning in Table 1. The first row in the table shows that the lacking of event-event interaction leads to poor performance of generated captions. When the sentence RNN or TSRM is incorporated, the generated sentence has the perception of previous events, thus a significant performance improvement can be achieved. The proposed CMG brings the model a big advantage at effective multi-modal fusion, which further boosts the captioning capability of the hierarchical RNN.
We also investigate the performance of training schemes and model ensemble in Table 2. When using SCST after the cross-entropy training, the METEOR score increases considerably from 11.49/7.65 to 14.18/9.71. When using the enlarged training set, the performance can obtain further improvement. Our single model achieves a 9.17 METEOR score on the challenging test set. In our final submission, we use an ensemble of three models with different seeds, which achieves a 9.28 METEOR score.
In this paper, we present a dense video captioning system with two plug-and-play modules, i.e. TSRM and CMG. TSRM aims to enhance the event-level representation by capturing rich relationships between events in terms of both temporal structure and semantic meaning. CMG is designed to effectively fuse the linguistic features and visual features in hierarchical RNN. Experimental results on ActivityNet Captions verify the effectiveness of our model.
This work was supported in part by the National Natural Science Foundation of China under Grant 61976231, Grant U1611461, Grant 61172141, Grant 61673402, and Grant 61573387.
-  (2019) Activitynet 2019 task 3: exploring contexts for dense captioning events in videos. arXiv preprint arXiv:1907.05092. Cited by: §1, §2.2, §2.3.
-  (2017) Dense-captioning events in videos. In Proc IEEE Int Conf Comput Vis, pp. 706–715. Cited by: §1.
-  (2019) Fast learning of temporal action proposal via dense boundary generator. arXiv preprint arXiv:1911.04127. Cited by: §2.2, §3.1.
Streamlined dense video captioning.
Proc IEEE Conf Comput Vis Pattern Recognit, pp. 6588–6597. Cited by: §2.2, §2.3.
Self-critical sequence training for image captioning. In Proc IEEE Conf Comput Vis Pattern Recognit, pp. 7008–7024. Cited by: §3.1.
-  (2017) Attention is all you need. In Proc Adv Neural Inf Proc Sys, pp. 5998–6008. Cited by: §2.3.
-  (2018) Bidirectional attentive fusion with context gating for dense video captioning. In Proc IEEE Conf Comput Vis Pattern Recognit, pp. 7190–7198. Cited by: §1, §2.3.
-  (2020) Event-centric hierarchical representation for dense video captioning. submitted to IEEE Trans. Circuits Syst. Video Technol.. Cited by: §2.3.
-  (2016) CUHK & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797. Cited by: §2.1.
-  (2018) Hierarchical context encoding for events captioning in videos. In Proc IEEE Int Conf Img Process, pp. 1288–1292. Cited by: §1.
Video paragraph captioning using hierarchical recurrent neural networks. In Proc IEEE Conf Comput Vis Pattern Recognit, pp. 4584–4593. Cited by: §2.3.