End-to-End Dense Video Captioning with Masked Transformer

04/03/2018
by   Luowei Zhou, et al.
0

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.

READ FULL TEXT

page 8

page 13

research
02/28/2018

Joint Event Detection and Description in Continuous Video Streams

As a fine-grained video understanding task, dense video captioning invol...
research
03/31/2018

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Dense video captioning is a newly emerging task that aims at both locali...
research
01/04/2022

Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing Spatio-temporal events in natural language, video capti...
research
11/25/2021

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

The canonical approach to video captioning dictates a caption generation...
research
08/17/2021

End-to-End Dense Video Captioning with Parallel Decoding

Dense video captioning aims to generate multiple associated captions wit...
research
07/18/2022

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Dense video captioning aims to generate corresponding text descriptions ...
research
01/06/2023

End-to-End 3D Dense Captioning with Vote2Cap-DETR

3D dense captioning aims to generate multiple captions localized with th...

Please sign up or login with your details

Forgot password? Click here to reset