Weakly-Supervised Video Object Grounding via Causal Intervention

12/01/2021
by   Wei Wang, et al.
0

We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning. Despite the recent progress, existing methods all suffer from the severe problem of spurious association, which will harm the grounding performance. In this paper, we start from the definition of WSVOG and pinpoint the spurious association from two aspects: (1) the association itself is not object-relevant but extremely ambiguous due to weak supervision, and (2) the association is unavoidably confounded by the observational bias when taking the statistics-based matching strategy in existing methods. With this in mind, we design a unified causal framework to learn the deconfounded object-relevant association for more accurate and robust video object grounding. Specifically, we learn the object-relevant association by causal intervention from the perspective of video data generation process. To overcome the problems of lacking fine-grained supervision in terms of intervention, we propose a novel spatial-temporal adversarial contrastive learning paradigm. To further remove the accompanying confounding effect within the object-relevant association, we pursue the true causality by conducting causal intervention via backdoor adjustment. Finally, the deconfounded object-relevant association is learned and optimized under a unified causal framework in an end-to-end manner. Extensive experiments on both IID and OOD testing sets of three benchmarks demonstrate its accurate and robust grounding performance against state-of-the-arts.

READ FULL TEXT

page 1

page 12

research
07/18/2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

3D visual grounding involves finding a target object in a 3D scene that ...
research
06/21/2020

Weak Supervision and Referring Attention for Temporal-Textual Association Learning

A system capturing the association between video frames and textual quer...
research
05/08/2018

Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction

We study weakly-supervised video object grounding: given a video segment...
research
06/06/2019

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

In this paper, we address a novel task, namely weakly-supervised spatio-...
research
04/21/2021

Improving Weakly-supervised Object Localization via Causal Intervention

The recent emerged weakly supervised object localization (WSOL) methods ...
research
03/16/2023

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

Humans excel at acquiring knowledge through observation. For example, we...
research
06/08/2021

Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

In this paper, we are tackling the weakly-supervised referring expressio...

Please sign up or login with your details

Forgot password? Click here to reset