Towards Diverse Paragraph Captioning for Untrimmed Videos

05/30/2021
by   Yuqing Song, et al.
0

Video paragraph captioning aims to describe multiple events in untrimmed videos with descriptive paragraphs. Existing approaches mainly solve the problem in two steps: event detection and then event captioning. Such two-step manner makes the quality of generated paragraphs highly dependent on the accuracy of event proposal detection which is already a challenging task. In this paper, we propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos. To describe coherent and diverse events, we propose to enhance the conventional temporal attention with dynamic video memories, which progressively exposes new video features and suppresses over-accessed video contents to control visual focuses of the model. In addition, a diversity-driven training strategy is proposed to improve diversity of paragraph on the language perspective. Considering that untrimmed videos generally contain massive but redundant frames, we further augment the video encoder with keyframe awareness to improve efficiency. Experimental results on the ActivityNet and Charades datasets show that our proposed model significantly outperforms the state-of-the-art performance on both accuracy and diversity metrics without using any event boundary annotations. Code will be released at https://github.com/syuqings/video-paragraph.

READ FULL TEXT

page 3

page 8

research
07/11/2019

Activitynet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos

Contextual reasoning is essential to understand events in long untrimmed...
research
06/14/2020

Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring Sequential Events Detection for Dense Video Captioning

Detecting meaningful events in an untrimmed video is essential for dense...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...
research
11/09/2015

Detecting events and key actors in multi-person videos

Multi-person event recognition is a challenging task, often with many pe...
research
11/28/2016

Bidirectional Multirate Reconstruction for Temporal Modeling in Videos

Despite the recent success of neural networks in image feature learning,...
research
06/17/2023

A New Perspective for Shuttlecock Hitting Event Detection

This article introduces a novel approach to shuttlecock hitting event de...
research
07/18/2022

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Dense video captioning aims to generate corresponding text descriptions ...

Please sign up or login with your details

Forgot password? Click here to reset