VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

06/26/2022
by   Kashu Yamazaki, et al.
3

In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.

READ FULL TEXT

page 1

page 2

research
11/28/2022

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description...
research
05/29/2023

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Vision and text have been fully explored in contemporary video-text foun...
research
07/28/2020

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Automatically generating sentences to describe events and temporally loc...
research
04/03/2023

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Existing 3D scene understanding tasks have achieved high performance on ...
research
11/18/2020

Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language

Neuro-symbolic representations have proved effective in learning structu...
research
12/26/2019

Vision and Language: from Visual Perception to Content Creation

Vision and language are two fundamental capabilities of human intelligen...
research
09/29/2021

Contrastive Video-Language Segmentation

We focus on the problem of segmenting a certain object referred by a nat...

Please sign up or login with your details

Forgot password? Click here to reset