Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

05/05/2021
by   Wei Suo, et al.
0

Referring Expression Comprehension (REC) has become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering. However, it has not been widely used in many downstream tasks because it suffers 1) two-stage methods exist heavy computation cost and inevitable error accumulation, and 2) one-stage methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) model that is able to regress the region-of-interest from the image, based on a textual query, in an end-to-end manner. Instead of using the dominant anchor proposal fashion, we directly take the dense-grid of an image as input for a cross-attention transformer that learns grid-word correspondences. The final bounding box is predicted directly from the image without the time-consuming anchor selection process that previous methods suffer. Our model achieves the state-of-the-art performance on four referring expression datasets with higher efficiency, comparing to previous best one-stage and two-stage methods.

READ FULL TEXT

page 1

page 3

page 6

research
02/12/2019

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Visual Grounding (VG) aims to locate the most relevant region in an imag...
research
09/16/2019

A Real-Time Cross-modality Correlation Filtering Method for Referring Expression Comprehension

Referring expression comprehension aims to localize the object instance ...
research
09/13/2021

DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection

Object detection is a fundamental task in computer vision. While approac...
research
06/06/2021

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., p...
research
04/03/2023

Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

The main challenge of monocular 3D object detection is the accurate loca...
research
10/24/2018

Resolving Referring Expressions in Images With Labeled Elements

Images may have elements containing text and a bounding box associated w...
research
08/07/2023

Keyword Spotting Simplified: A Segmentation-Free Approach using Character Counting and CTC re-scoring

Recent advances in segmentation-free keyword spotting treat this problem...

Please sign up or login with your details

Forgot password? Click here to reset