End-to-end Dense Video Captioning as Sequence Generation

04/18/2022
by   Wanrong Zhu, et al.
3

Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event. Previous approaches usually follow a two-stage generative process, which first proposes a segment for each event, then renders a caption for each identified segment. Recent advances in large-scale sequence generation pretraining have seen great success in unifying task formulation for a great variety of tasks, but so far, more complex tasks such as dense video captioning are not able to fully utilize this powerful paradigm. In this work, we show how to model the two subtasks of dense video captioning jointly as one sequence generation task, and simultaneously predict the events and the corresponding descriptions. Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks such as end-to-end dense video captioning integrated into large-scale pre-trained models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2018

Joint Event Detection and Description in Continuous Video Streams

As a fine-grained video understanding task, dense video captioning invol...
research
07/18/2022

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Dense video captioning aims to generate corresponding text descriptions ...
research
02/27/2023

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
research
04/12/2022

Video Captioning: a comparative review of where we are and which could be the route

Video captioning is the process of describing the content of a sequence ...
research
03/04/2023

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Benefiting from large-scale vision-language pre-training on image-text p...
research
05/02/2017

Dense-Captioning Events in Videos

Most natural videos contain numerous events. For example, in a video of ...
research
11/10/2020

Multimodal Pretraining for Dense Video Captioning

Learning specific hands-on skills such as cooking, car maintenance, and ...

Please sign up or login with your details

Forgot password? Click here to reset