Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

03/29/2018
by   Raymond A. Yeh, et al.
0

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08 respectively.

READ FULL TEXT

page 2

page 6

page 7

research
03/29/2018

Unsupervised Textual Grounding: Linking Words to Image Concepts

Textual grounding, i.e., linking words to objects in images, is a challe...
research
11/29/2019

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...
research
08/11/2021

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is def...
research
05/18/2023

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual ground...
research
09/03/2020

Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

The prevailing framework for solving referring expression grounding is b...
research
09/08/2023

Three Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

3D visual grounding is the task of localizing the object in a 3D scene w...
research
03/30/2022

SeqTR: A Simple yet Universal Network for Visual Grounding

In this paper, we propose a simple yet universal network termed SeqTR fo...

Please sign up or login with your details

Forgot password? Click here to reset