Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

07/18/2022
by   Qi Zhang, et al.
0

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning. Unlike previous works that tackle the two sub-tasks separately, recent works have focused on enhancing the inter-task association between the two sub-tasks. However, designing inter-task interactions for event detection and captioning is not trivial due to the large differences in their task specific solutions. Besides, previous event detection methods normally ignore temporal dependencies between events, leading to event redundancy or inconsistency problems. To tackle above the two defects, in this paper, we define event detection as a sequence generation task and propose a unified pre-training and fine-tuning framework to naturally enhance the inter-task association between event detection and captioning. Since the model predicts each event with previous events as context, the inter-dependency between events is fully exploited and thus our model can detect more diverse and consistent events in the video. Experiments on the ActivityNet dataset show that our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data. Code is available at <https://github.com/QiQAng/UEDVC>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2022

End-to-end Dense Video Captioning as Sequence Generation

Dense video captioning aims to identify the events of interest in an inp...
research
06/17/2023

A New Perspective for Shuttlecock Hitting Event Detection

This article introduces a novel approach to shuttlecock hitting event de...
research
05/30/2021

Towards Diverse Paragraph Captioning for Untrimmed Videos

Video paragraph captioning aims to describe multiple events in untrimmed...
research
04/08/2019

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate a...
research
04/03/2018

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events...
research
03/31/2018

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Dense video captioning is a newly emerging task that aims at both locali...
research
09/13/2021

Learning Constraints and Descriptive Segmentation for Subevent Detection

Event mentions in text correspond to real-world events of varying degree...

Please sign up or login with your details

Forgot password? Click here to reset