TransVG: End-to-End Visual Grounding with Transformers

04/17/2021
by   Jiajun Deng, et al.
0

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods, including two-stage or one-stage ones, rely on a complex module with manually-designed mechanisms to perform the query reasoning and multi-modal fusion. However, the involvement of certain mechanisms in fusion module design, such as query decomposition and image scene graph, makes the models easily overfit to datasets with specific scenarios, and limits the plenitudinous interaction between the visual-linguistic context. To avoid this caveat, we propose to establish the multi-modal correspondence by leveraging transformers, and empirically show that the complex fusion modules (e.g., modular attention network, dynamic graph, and multi-modal tree) can be replaced by a simple stack of transformer encoder layers with higher performance. Moreover, we re-formulate the visual grounding as a direct coordinates regression problem and avoid making predictions out of a set of candidates (i.e., region proposals or anchor boxes). Extensive experiments are conducted on five widely used datasets, and a series of state-of-the-art records are set by our TransVG. We build the benchmark of transformer-based visual grounding framework and will make our code available to the public.

READ FULL TEXT

page 3

page 8

research
06/14/2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks...
research
08/18/2023

Language-Guided Diffusion Model for Visual Grounding

Visual grounding (VG) tasks involve explicit cross-modal alignment, as s...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
09/13/2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Video grounding aims to localize the temporal segment corresponding to a...
research
08/11/2021

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is def...
research
10/14/2019

Dynamic Attention Networks for Task Oriented Grounding

In order to successfully perform tasks specified by natural language ins...
research
10/22/2022

HAM: Hierarchical Attention Model with High Performance for 3D Visual Grounding

This paper tackles an emerging and challenging vision-language task, 3D ...

Please sign up or login with your details

Forgot password? Click here to reset