Discriminative Latent Semantic Graph for Video Captioning

08/08/2021
by   Yang Bai, et al.
8

Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

READ FULL TEXT

page 3

page 8

research
06/11/2019

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descrip...
research
08/19/2022

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify ...
research
07/17/2020

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Generating natural language descriptions for videos, i.e., video caption...
research
06/18/2022

REVECA – Rich Encoder-decoder framework for Video Event CAptioner

We describe an approach used in the Generic Boundary Event Captioning ch...
research
12/21/2016

Top-down Visual Saliency Guided by Captions

Neural image/video captioning models can generate accurate descriptions,...
research
05/06/2022

Dual-Level Decoupled Transformer for Video Captioning

Video captioning aims to understand the spatio-temporal semantic concept...
research
05/22/2022

GL-RG: Global-Local Representation Granularity for Video Captioning

Video captioning is a challenging task as it needs to accurately transfo...

Please sign up or login with your details

Forgot password? Click here to reset