DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

11/19/2021
by   Xu Yan, et al.
0

Dense video captioning (DVC) aims to generate multi-sentence descriptions to elucidate the multiple events in the video, which is challenging and demands visual consistency, discoursal coherence, and linguistic diversity. Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context and progressive alignment between the fast-evolved visual content and textual descriptions, which results in redundant and spliced descriptions. In this paper, we introduce the concept of information flow to model the progressive information changing across video sequence and captions. By designing a Cross-modal Information Flow Alignment mechanism, the visual and textual information flows are captured and aligned, which endows the captioning process with richer context and dynamics on event/topic evolution. Based on the Cross-modal Information Flow Alignment module, we further put forward DVCFlow framework, which consists of a Global-local Visual Encoder to capture both global features and local features for each video segment, and a pre-trained Caption Generator to produce captions. Extensive experiments on the popular ActivityNet Captions and YouCookII datasets demonstrate that our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.

READ FULL TEXT

page 1

page 8

research
08/14/2021

Cross-Modal Graph with Meta Concepts for Video Captioning

Video captioning targets interpreting the complex visual contents as tex...
research
10/18/2022

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

In recent years, vision and language pre-training (VLP) models have adva...
research
03/26/2023

SEM-POS: Grammatically and Semantically Correct Video Captioning

Generating grammatically and semantically correct captions in video capt...
research
11/28/2022

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description...
research
01/15/2023

Generating Templated Caption for Video Grounding

Video grounding aims to locate a moment of interest matching the given q...
research
07/29/2020

Enriching Video Captions With Contextual Text

Understanding video content and generating caption with context is an im...
research
03/14/2019

Show, Translate and Tell

Humans have an incredible ability to process and understand information ...

Please sign up or login with your details

Forgot password? Click here to reset