Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

04/04/2022
by   Ziyue Wu, et al.
0

Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. In this paper, we deal with the fast video temporal grounding (FVTG) task, aiming at localizing the target segment with high speed and favorable accuracy. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance, which suffer from the test-time bottleneck. Although several common space-based methods enjoy the high-speed merit during inference, they can hardly capture the comprehensive and explicit relations between visual and textual modalities. In this paper, to tackle the dilemma of speed-accuracy tradeoff, we propose a commonsense-aware cross-modal alignment (CCA) framework, which incorporates commonsense-guided visual and text representations into a complementary common space for fast video temporal grounding. Specifically, the commonsense concepts are explored and exploited by extracting the structural semantic information from a language corpus. Then, a commonsense-aware interaction module is designed to obtain bridged visual and text features by utilizing the learned commonsense concepts. Finally, to maintain the original semantic information of textual queries, a cross-modal complementary common space is optimized to obtain matching scores for performing FVTG. Extensive results on two challenging benchmarks show that our CCA method performs favorably against state-of-the-arts while running at high speed. Our code is available at https://github.com/ZiyueWu59/CCA.

READ FULL TEXT

page 1

page 11

research
07/29/2022

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Temporal grounding aims to locate a target video moment that semanticall...
research
07/17/2020

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

Image-text matching plays a central role in bridging vision and language...
research
11/20/2019

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language ...
research
08/18/2023

Language-Guided Diffusion Model for Visual Grounding

Visual grounding (VG) tasks involve explicit cross-modal alignment, as s...
research
09/27/2022

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-...
research
09/10/2021

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

Temporal grounding aims to localize a video moment which is semantically...
research
03/21/2023

Joint Visual Grounding and Tracking with Natural Language Specification

Tracking by natural language specification aims to locate the referred t...

Please sign up or login with your details

Forgot password? Click here to reset