TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

06/14/2022
by   Jiajun Deng, et al.
19

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

READ FULL TEXT

page 2

page 4

page 13

research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
07/27/2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding wit...
research
12/04/2021

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...
research
09/21/2022

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

During recent years transformers architectures have been growing in popu...
research
07/18/2023

R-Cut: Enhancing Explainability in Vision Transformers with Relationship Weighted Out and Cut

Transformer-based models have gained popularity in the field of natural ...
research
07/07/2023

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Current mainstream vision-language (VL) tracking framework consists of t...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...

Please sign up or login with your details

Forgot password? Click here to reset