D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

08/08/2023
by   Hanjun Li, et al.
0

Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.

READ FULL TEXT

page 4

page 8

page 13

research
04/20/2022

Video Moment Retrieval from Text Queries via Single Frame Annotation

Video moment retrieval aims at finding the start and end timestamps of a...
research
02/20/2023

Constraint and Union for Partially-Supervised Temporal Sentence Grounding

Temporal sentence grounding aims to detect the event timestamps describe...
research
03/16/2022

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Visual grounding, i.e., localizing objects in images according to natura...
research
01/22/2021

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics

Despite Temporal Sentence Grounding in Videos (TSGV) has realized impres...
research
08/10/2023

Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization

Video moment localization aims to retrieve the target segment of an untr...
research
03/10/2022

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a nat...
research
09/23/2021

Self-supervised Learning for Semi-supervised Temporal Language Grounding

Given a text description, Temporal Language Grounding (TLG) aims to loca...

Please sign up or login with your details

Forgot password? Click here to reset