MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

06/06/2020
by   Amar Shrestha, et al.
0

Grounding free-form textual queries necessitates an understanding of these textual phrases and its relation to the visual cues to reliably reason about the described locations. Spatial attention networks are known to learn this relationship and focus its gaze on salient objects in the image. Thus, we propose to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. We focus only on the phrase query - ground truth pair (referring expression) for a model independent of the constraints of the datasets i.e. additional attributes, context etc. For such referring expression dataset ReferIt game, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art. Without the context from image captions and attribute information in Flickr30k Entities, we still achieve competitive results compared to the state-of-the-art.

READ FULL TEXT

page 3

page 9

research
12/07/2018

PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding

Phrase Grounding aims to detect and localize objects in images that are ...
research
03/19/2020

Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding

We propose a new spatial memory module and a spatial reasoner for the Vi...
research
11/12/2015

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...
research
08/04/2017

Query-guided Regression Network with Context Policy for Phrase Grounding

Given a textual description of an image, phrase grounding localizes obje...
research
08/11/2022

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to ...
research
05/15/2020

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Person search by natural language aims at retrieving a specific person i...
research
08/20/2019

Zero-Shot Grounding of Objects from Natural Language Queries

A phrase grounding system localizes a particular object in an image refe...

Please sign up or login with your details

Forgot password? Click here to reset