FindIt: Generalized Localization with Natural Language Queries

03/31/2022
by   Weicheng Kuo, et al.
0

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released.

READ FULL TEXT

page 2

page 11

page 21

page 22

research
06/12/2022

GLIPv2: Unifying Localization and Vision-Language Understanding

We present GLIPv2, a grounded VL understanding model, that serves both l...
research
09/16/2015

DenseBox: Unifying Landmark Localization with End to End Object Detection

How can a single fully convolutional neural network (FCN) perform on obj...
research
04/12/2017

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Associating image regions with text queries has been recently explored a...
research
07/11/2021

Towards Accurate Localization by Instance Search

Visual object localization is the key step in a series of object detecti...
research
06/11/2021

Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization

Entities Object Localization (EOL) aims to evaluate how grounded or fait...
research
03/02/2023

Task-Specific Context Decoupling for Object Detection

Classification and localization are two main sub-tasks in object detecti...
research
05/26/2018

Using Syntax to Ground Referring Expressions in Natural Images

We introduce GroundNet, a neural network for referring expression recogn...

Please sign up or login with your details

Forgot password? Click here to reset