G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

07/26/2023
by   Hongxiang Li, et al.
0

The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) alignment of features of similar samples, and (2) uniformity of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, semantic overlapping; (2) only a few moments in the video are annotated, sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.

READ FULL TEXT

page 4

page 8

research
10/21/2022

Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an...
research
08/24/2021

Support-Set Based Cross-Supervision for Video Grounding

Current approaches for video grounding propose kinds of complex architec...
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
08/11/2022

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Video Object Grounding (VOG) is the problem of associating spatial objec...
research
08/19/2020

Learning Trailer Moments in Full-Length Movies

A movie's key moments stand out of the screenplay to grab an audience's ...
research
07/21/2022

Correspondence Matters for Video Referring Expression Comprehension

We investigate the problem of video Referring Expression Comprehension (...
research
04/22/2022

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...

Please sign up or login with your details

Forgot password? Click here to reset