Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

09/28/2022
by   Fengyuan Shi, et al.
0

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention operation with quadratic time complexity. To address this issue, we present a new multimodal transformer architecture, coined as Dynamic MDETR, by decoupling the whole grounding process into encoding and decoding phases. The key observation is that there exists high spatial redundancy in images. Thus, we devise a new dynamic multimodal transformer decoder by exploiting this sparsity prior to speed up the visual grounding process. Specifically, our dynamic decoder is composed of a 2D adaptive sampling module and a text-guided decoding module. The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features. These two modules are stacked alternatively to gradually bridge the modality gap and iteratively refine the reference point of grounded object, eventually realizing the objective of visual grounding. Extensive experiments on five benchmarks demonstrate that our proposed Dynamic MDETR achieves competitive trade-offs between computation and accuracy. Notably, using only 9 decoder, we can reduce  44 higher accuracy than the encoder-only counterpart. In addition, to verify its generalization ability and scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual grounding framework, and achieve the state-of-the-art performance on these benchmarks.

READ FULL TEXT

page 2

page 4

page 13

page 14

research
09/17/2021

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Visual dialogue is a challenging task since it needs to answer a series ...
research
08/05/2021

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Recently proposed fine-grained 3D visual grounding is an essential and c...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
11/16/2021

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Scene text recognition (STR) is an important bridge between images and t...
research
09/16/2021

Fast-Slow Transformer for Visually Grounding Speech

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
research
03/29/2023

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

Understanding 3D scenes from multi-view inputs has been proven to allevi...
research
05/10/2021

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual ground...

Please sign up or login with your details

Forgot password? Click here to reset