Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

07/31/2021
by   Heng Zhao, et al.
0

Current one-stage methods for visual grounding encode the language query as one holistic sentence embedding before fusion with visual feature. Such a formulation does not treat each word of a query sentence on par when modeling language to visual attention, therefore prone to neglect words which are less important for sentence embedding but critical for visual grounding. In this paper we propose Word2Pix: a one-stage visual grounding network based on encoder-decoder transformer architecture that enables learning for textual to visual feature correspondence via word to pixel attention. The embedding of each word from the query sentence is treated alike by attending to visual pixels individually instead of single holistic sentence embedding. In this way, each word is given equivalent opportunity to adjust the language to vision attention towards the referent target through multiple stacks of transformer decoder layers. We conduct the experiments on RefCOCO, RefCOCO+ and RefCOCOg datasets and the proposed Word2Pix outperforms existing one-stage methods by a notable margin. The results obtained also show that Word2Pix surpasses two-stage visual grounding models, while at the same time keeping the merits of one-stage paradigm namely end-to-end training and real-time inference speed intact.

READ FULL TEXT

page 1

page 4

page 8

page 9

research
08/03/2020

Improving One-stage Visual Grounding by Recursive Sub-query Construction

We improve one-stage visual grounding by addressing current limitations ...
research
05/10/2021

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual ground...
research
01/09/2023

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding ...
research
10/22/2022

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

This paper tackles an emerging and challenging vision-language task, 3D ...
research
05/07/2019

Show, Price and Negotiate: A Hierarchical Attention Recurrent Visual Negotiator

Negotiation, as a seller or buyer, is an essential and complicated aspec...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
11/23/2017

Self-view Grounding Given a Narrated 360° Video

Narrated 360 videos are typically provided in many touring scenarios to ...

Please sign up or login with your details

Forgot password? Click here to reset