Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

05/06/2023
by   Daizong Liu, et al.
0

This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.

READ FULL TEXT

page 2

page 5

page 7

research
03/28/2022

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...
research
01/14/2022

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Temporal video grounding (TVG) aims to localize a target segment in a vi...
research
01/02/2023

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary...
research
08/05/2020

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Temporal cues in videos provide important information for recognizing ac...
research
09/23/2020

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

The task of language-guided video temporal grounding is to localize the ...
research
01/05/2023

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Given an untrimmed video, temporal sentence localization (TSL) aims to l...
research
08/14/2023

Temporal Sentence Grounding in Streaming Videos

This paper aims to tackle a novel task - Temporal Sentence Grounding in ...

Please sign up or login with your details

Forgot password? Click here to reset