1 Sentence Generation Model
, we formulate our sentence generation model in an end-to-end fashion based on LSTM which encodes the input frame/optical flow sequence into a fixed dimensional vector via temporal attention mechanism and then decodes it to each target output word. An overview of our sentence generation model is illustrated in Figure1.
In particular, given the input video with frame and optical flow sequences, each input frame/optical flow sequence () is fed into a two-layer LSTM with attention mechanism. At each time step , the attention LSTM decoder firstly collects the maximum contextual information by concatenating the input word with the previous output of the second-layer LSTM unit and the mean-pooled video-level representation , which will be set as the input of the first-layer LSTM unit. Hence the updating procedure for the first-layer LSTM unit is as
where is the transformation matric for input word , is the output of the first-layer LSTM unit, and is the updating function within the first-layer LSTM unit. Next, depending on the output of the first-layer LSTM unit, a normalized attention distribution over all the frame/optical flow features is generated as:
where is the -th element of , , and are transformation matrices. denotes the normalized attention distribution and its -th element
is the attention probability of. Based on the attention distribution, we calculate the attended video-level representation by aggregating all the frame/optical flow features weighted with attention. We further concatenate the attended video-level feature with and feed them into the second-layer LSTM unit, whose updating procedure is thus given as:
where is the updating function within the second-layer LSTM unit. The output of the second-layer LSTM unit is leveraged to predict the next word4, 9] is additionally leveraged to boost the sentence generation performances specific to METEOR metric.
2.1 Features and Parameter Settings
Each word in the sentence is represented as “one-hot” vector (binary index vector in a vocabulary). For the input video representations, we take the output of 2048-way layer from P3D ResNet  pre-trained on Kinetics dataset  as frame/optical flow representation. The dimension of the hidden layer in each LSTM is set as 1,000. The dimension of the hidden layer for measuring attention distribution is set as 512.
Two slightly different settings of our LSTM-T are named as LSTM-T and LSTM-T which are trained with only frame and optical flow sequence, respectively. Table 1 shows the performances of our models on ActivityNet captions validation set. The results clearly indicate that by utilizing both frame and optical flow sequences in a late fusion manner, our LSTM-T boosts up the performances.
In this challenge, we mainly focus on the dense-captioning events in videos task and present a system by leveraging the three-stage workflow for temporal event proposal and LSTM-based captioning model with temporal attention mechanism for sentence generation. One possible future research direction would be how to end-to-end formulate the whole dense-captioning events in videos system.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei.
Jointly localizing and describing events for dense video captioning.In CVPR, 2018.
-  S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Optimization of image description metrics using policy gradient methods. In ICCV, 2017.
-  Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.
-  Y. Pan, Z. Qiu, T. Yao, H. Li, and T. Mei. Seeing bot. In SIGIR, 2017.
-  Y. Pan, T. Yao, H. Li, and T. Mei. Video captioning with transferred semantic attributes. In CVPR, 2017.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
-  S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, 2015.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
-  L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
-  T. Yao, Y. Li, Z. Qiu, F. Long, Y. Pan, D. Li, and T. Mei. Msr asia msm at activitynet challenge 2017: Trimmed action recognition, temporal action proposals and dense-captioning events in videos. In CVPR ActivityNet Challenge Workshop, 2017.
-  T. Yao, Y. Pan, Y. Li, and T. Mei. Incorporating copying mechanism in image captioning for learning novel objects. In CVPR, 2017.
-  T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. Boosting image captioning with attributes. In ICCV, 2017.
-  Y. Zhao, Y. Xiong, L. Wang, Z. Wu, D. Lin, and X. Tang. Temporal action detection with structured segment networks. arXiv preprint arXiv:1704.06228, 2017.