Deconfounded Video Moment Retrieval with Causal Intervention

06/03/2021
by   Xun Yang, et al.
0

We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. Despite their effectiveness, current models mostly exploit dataset biases while ignoring the video content, thus leading to poor generalizability. We argue that the issue is caused by the hidden confounder in VMR, i.e., temporal location of moments, that spuriously correlates the model input and prediction. How to design robust matching models against the temporal location biases is crucial but, as far as we know, has not been studied yet for VMR. To fill the research gap, we propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction. Specifically, we develop a Deconfounded Cross-modal Matching (DCM) method to remove the confounding effects of moment location. It first disentangles moment representation to infer the core feature of visual content, and then applies causal intervention on the disentangled multimodal input based on backdoor adjustment, which forces the model to fairly incorporate each possible location of the target into consideration. Extensive experiments clearly show that our approach can achieve significant improvement over the state-of-the-art methods in terms of both accuracy and generalization (Codes: <https://github.com/Xun-Yang/Causal_Video_Moment_Retrieval>

READ FULL TEXT

page 2

page 8

research
07/29/2022

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Temporal grounding aims to locate a target video moment that semanticall...
research
10/23/2022

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Video corpus moment retrieval (VCMR) is the task to retrieve the most re...
research
09/01/2020

Uncovering Hidden Challenges in Query-Based Video Moment Retrieval

The query-based moment retrieval is a problem of localising a specific c...
research
09/07/2021

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Temporal moment localization aims to retrieve the best video segment mat...
research
03/10/2022

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a nat...
research
03/24/2023

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Recently, video moment retrieval and highlight detection (MR/HD) are bei...
research
01/22/2021

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics

Despite Temporal Sentence Grounding in Videos (TSGV) has realized impres...

Please sign up or login with your details

Forgot password? Click here to reset