YORO – Lightweight End to End Visual Grounding

11/15/2022
by   Chih-Hui Ho, et al.
0

We present YORO - a multi-modal transformer encoder-only architecture for the Visual Grounding (VG) task. This task involves localizing, in an image, an object referred via natural language. Unlike the recent trend in the literature of using multi-stage approaches that sacrifice speed for accuracy, YORO seeks a better trade-off between speed an accuracy by embracing a single-stage design, without CNN backbone. YORO consumes natural language queries, image patches, and learnable detection tokens and predicts coordinates of the referred object, using a single transformer encoder. To assist the alignment between text and visual objects, a novel patch-text alignment loss is proposed. Extensive experiments are conducted on 5 different datasets with ablations on architecture design choices. YORO is shown to support real-time inference and outperform all approaches in this class (single-stage methods) by large margins. It is also the fastest VG model and achieves the best speed/accuracy trade-off in the literature.

READ FULL TEXT

page 9

page 14

page 23

page 25

research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
09/11/2020

AttnGrounder: Talking to Cars with Attention

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end ...
research
07/31/2021

Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query...
research
09/28/2022

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align i...
research
05/18/2021

Vision Transformer for Fast and Efficient Scene Text Recognition

Scene text recognition (STR) enables computers to read text in natural s...
research
06/14/2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks...
research
01/09/2023

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding ...

Please sign up or login with your details

Forgot password? Click here to reset