Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos

06/25/2018 ∙ by Yuan Liu, et al. ∙ 0

This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Sentence Generation Model

Inspired from the recent successes of LSTM based sequence models leveraged in image/video captioning [1, 3, 5, 6, 7, 10, 11, 12, 14, 15]

, we formulate our sentence generation model in an end-to-end fashion based on LSTM which encodes the input frame/optical flow sequence into a fixed dimensional vector via temporal attention mechanism and then decodes it to each target output word. An overview of our sentence generation model is illustrated in Figure

1.

Figure 1: The sentence generation model in our system for dense-captioning events in videos task.
   Model      B@1     B@2     B@3     B@4     M     R     C
   LSTM-T 12.71 7.24 4.01 1.99 8.99 14.67 13.82
   LSTM-T 12.46 7.08 3.96 1.97 8.72 14.55 13.60
   LSTM-T 13.19 7.75 4.48 2.31 9.26 15.18 14.97
Table 1: Performance on ActivityNet captions validation set, where B@, M, R and C are short for BLEU@, METEOR, ROUGE-L and CIDEr-D scores. All values are reported as percentage (%).

In particular, given the input video with frame and optical flow sequences, each input frame/optical flow sequence () is fed into a two-layer LSTM with attention mechanism. At each time step , the attention LSTM decoder firstly collects the maximum contextual information by concatenating the input word with the previous output of the second-layer LSTM unit and the mean-pooled video-level representation , which will be set as the input of the first-layer LSTM unit. Hence the updating procedure for the first-layer LSTM unit is as

(1)

where is the transformation matric for input word , is the output of the first-layer LSTM unit, and is the updating function within the first-layer LSTM unit. Next, depending on the output of the first-layer LSTM unit, a normalized attention distribution over all the frame/optical flow features is generated as:

(2)

where is the -th element of , , and are transformation matrices. denotes the normalized attention distribution and its -th element

is the attention probability of

. Based on the attention distribution, we calculate the attended video-level representation by aggregating all the frame/optical flow features weighted with attention. We further concatenate the attended video-level feature with and feed them into the second-layer LSTM unit, whose updating procedure is thus given as:

(3)

where is the updating function within the second-layer LSTM unit. The output of the second-layer LSTM unit is leveraged to predict the next word

through a softmax layer. Note that the policy gradient optimization method with reinforcement learning

[4, 9] is additionally leveraged to boost the sentence generation performances specific to METEOR metric.

2 Experiments

2.1 Features and Parameter Settings

Each word in the sentence is represented as “one-hot” vector (binary index vector in a vocabulary). For the input video representations, we take the output of 2048-way layer from P3D ResNet [8] pre-trained on Kinetics dataset [2] as frame/optical flow representation. The dimension of the hidden layer in each LSTM is set as 1,000. The dimension of the hidden layer for measuring attention distribution is set as 512.

2.2 Results

Two slightly different settings of our LSTM-T are named as LSTM-T and LSTM-T which are trained with only frame and optical flow sequence, respectively. Table 1 shows the performances of our models on ActivityNet captions validation set. The results clearly indicate that by utilizing both frame and optical flow sequences in a late fusion manner, our LSTM-T boosts up the performances.

3 Conclusions

In this challenge, we mainly focus on the dense-captioning events in videos task and present a system by leveraging the three-stage workflow for temporal event proposal and LSTM-based captioning model with temporal attention mechanism for sentence generation. One possible future research direction would be how to end-to-end formulate the whole dense-captioning events in videos system.

References