Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

03/24/2021
by   Yongfei Liu, et al.
3

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. One promising and scalable strategy for learning visual grounding is to utilize weak supervision from only image-caption pairs. Previous methods typically rely on matching query phrases directly to a precomputed, fixed object candidate pool, which leads to inaccurate localization and ambiguous matching due to lack of semantic relation constraints. In our paper, we propose a novel context-aware weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling into a two-stage deep network, capable of producing more accurate object representation and matching. To effectively train our network, we introduce a self-taught regression loss for the proposal locations and a classification loss based on parsed entity relations. Extensive experiments on two public benchmarks Flickr30K Entities and ReferItGame demonstrate the efficacy of our weakly grounding framework. The results show that we outperform the previous methods by a considerable margin, achieving 59.27% top-1 accuracy in Flickr30K Entities and 37.68% in the ReferItGame dataset respectively (Code is available at https://github.com/youngfly11/ReIR-WeaklyGrounding.pytorch.git).

READ FULL TEXT

page 4

page 8

research
11/20/2019

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language ...
research
08/03/2022

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Recently, increasing efforts have been focused on Weakly Supervised Scen...
research
09/05/2019

Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding

Weakly supervised referring expression grounding (REG) aims at localizin...
research
09/07/2023

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

It has been established that training a box-based detector network can e...
research
12/08/2021

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Today's VidSGG models are all proposal-based methods, i.e., they first g...
research
03/17/2018

Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Localizing natural language phrases in images is a challenging problem t...
research
07/18/2022

Entity-enhanced Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding

Weakly supervised Referring Expression Grounding (REG) aims to ground a ...

Please sign up or login with your details

Forgot password? Click here to reset