Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

01/18/2022
by   Hengcan Shi, et al.
0

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning image-text matching and lack of the top-down guidance with unpaired data. In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps. A cross-modal object matching (COM) module is further introduced, which exploits the recently emerged image-text matching pretrained model, CLIP, to predict the target objects from a bottom-up perspective. The top-down and bottom-up predictions are then integrated via a similarity funsion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework outperforms previous works by 6.55 and 9.94

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
07/29/2022

Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training Framework for Temporal Grounding

Temporal grounding aims to locate a target video moment that semanticall...
research
11/25/2018

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Vision-language navigation (VLN) is the task of navigating an embodied a...
research
06/11/2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Grounding referring expressions is a fundamental yet challenging task fa...
research
06/27/2021

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

In this paper, we address the text-to-audio grounding issue, namely, gro...
research
03/23/2021

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...
research
01/03/2022

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video u...

Please sign up or login with your details

Forgot password? Click here to reset