Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

04/12/2017
by   Yuting Zhang, et al.
0

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

READ FULL TEXT

page 17

page 18

page 20

page 22

page 25

page 30

page 31

page 33

research
11/24/2015

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

We introduce the dense captioning task, which requires a computer vision...
research
03/31/2022

FindIt: Generalized Localization with Natural Language Queries

We propose FindIt, a simple and versatile framework that unifies a varie...
research
07/02/2017

Where to Play: Retrieval of Video Segments using Natural-Language Queries

In this paper, we propose a new approach for retrieval of video segments...
research
09/26/2016

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Learning a joint language-visual embedding has a number of very appealin...
research
11/21/2016

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

This paper presents a framework for localization or grounding of phrases...
research
04/18/2018

Object Ordering with Bidirectional Matchings for Visual Reasoning

Visual reasoning with compositional natural language instructions, e.g.,...
research
12/13/2022

CREPE: Can Vision-Language Foundation Models Reason Compositionally?

A fundamental characteristic common to both human vision and natural lan...

Please sign up or login with your details

Forgot password? Click here to reset