Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

09/10/2021
by   Zhenzhi Wang, et al.
0

Temporal grounding aims to localize a video moment which is semantically aligned with a given natural language query. Existing methods typically apply a detection or regression pipeline on the fused representation with the research focus on designing complicated prediction heads or fusion strategies. Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space. This new metric-learning framework enables fully exploiting negative samples from two new aspects: constructing negative cross-modal pairs in a mutual matching scheme and mining negative pairs across different videos. These new negative samples could enhance the joint representation learning of two modalities via cross-modal mutual matching to maximize their mutual information. Experiments show that our MMN achieves highly competitive performance compared with the state-of-the-art methods on four video grounding benchmarks. Based on MMN, we present a winner solution for the HC-STVG challenge of the 3rd PIC workshop. This suggests that metric learning is still a promising method for temporal grounding via capturing the essential cross-modal correlation in a joint embedding space. Code is available at https://github.com/MCG-NJU/MMN.

READ FULL TEXT

page 1

page 4

page 8

page 15

page 17

page 18

research
07/29/2022

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Temporal grounding aims to locate a target video moment that semanticall...
research
10/07/2020

Universal Weighting Metric Learning for Cross-Modal Matching

Cross-modal matching has been a highlighted research topic in both visio...
research
11/19/2020

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Grounding language queries in videos aims at identifying the time interv...
research
10/23/2020

Hard Example Generation by Texture Synthesis for Cross-domain Shape Similarity Learning

Image-based 3D shape retrieval (IBSR) aims to find the corresponding 3D ...
research
04/04/2022

Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding

Grounding temporal video segments described in natural language queries ...
research
08/22/2022

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Recent years have seen an increased interest in establishing association...
research
08/22/2020

Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Both images and music can convey rich semantics and are widely used to i...

Please sign up or login with your details

Forgot password? Click here to reset