Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding

by   Hao Zhou, et al.

Temporal grounding aims to localize temporal boundaries within untrimmed videos by language queries, but it faces the challenge of two types of inevitable human uncertainties: query uncertainty and label uncertainty. The two uncertainties stem from human subjectivity, leading to limited generalization ability of temporal grounding. In this work, we propose a novel DeNet (Decoupling and De-bias) to embrace human uncertainty: Decoupling - We explicitly disentangle each query into a relation feature and a modified feature. The relation feature, which is mainly based on skeleton-like words (including nouns and verbs), aims to extract basic and consistent information in the presence of query uncertainty. Meanwhile, modified feature assigned with style-like words (including adjectives, adverbs, etc) represents the subjective information, and thus brings personalized predictions; De-bias - We propose a de-bias mechanism to generate diverse predictions, aim to alleviate the bias caused by single-style annotations in the presence of label uncertainty. Moreover, we put forward new multi-label metrics to diversify the performance evaluation. Extensive experiments show that our approach is more effective and robust than state-of-the-arts on Charades-STA and ActivityNet Captions datasets.


page 1

page 3

page 7

page 8


Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Temporal grounding aims to locate a target video moment that semanticall...

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

The task of language-guided video temporal grounding is to localize the ...

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Temporal language grounding in videos aims to localize the temporal span...

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Temporal grounding in videos aims to localize one target video segment t...

Impact of individual rater style on deep learning uncertainty in medical imaging segmentation

While multiple studies have explored the relation between inter-rater va...

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a nat...

Probabilistic Modeling of Semantic Ambiguity for Scene Graph Generation

To generate "accurate" scene graphs, almost all existing methods predict...