SeqTR: A Simple yet Universal Network for Visual Grounding

03/30/2022
by   Chaoyang Zhu, et al.
4

In this paper, we propose a simple yet universal network termed SeqTR for visual grounding tasks, e.g., phrase localization, referring expression comprehension (REC) and segmentation (RES). The canonical paradigms for visual grounding often require substantial expertise in designing network architectures and loss functions, making them hard to generalize across tasks. To simplify and unify the modeling, we cast visual grounding as a point prediction problem conditioned on image and text inputs, where either the bounding box or binary mask is represented as a sequence of discrete coordinate tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads, e.g., the convolutional mask decoder for RES, which greatly reduces the complexity of multi-task modeling. In addition, SeqTR also shares the same optimization objective for all tasks with a simple cross-entropy loss, further reducing the complexity of deploying hand-crafted loss functions. Experiments on five benchmark datasets demonstrate that the proposed SeqTR outperforms (or is on par with) the existing state-of-the-arts, proving that a simple yet universal approach for visual grounding is indeed feasible.

READ FULL TEXT

page 2

page 7

page 14

page 22

research
06/06/2021

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., p...
research
08/27/2023

Towards Unified Token Learning for Vision-Language Tracking

In this paper, we present a simple, flexible and effective vision-langua...
research
09/13/2020

Cosine meets Softmax: A tough-to-beat baseline for visual grounding

In this paper, we present a simple baseline for visual grounding for aut...
research
03/13/2023

Parallel Vertex Diffusion for Unified Visual Grounding

Unified visual grounding pursues a simple and generic technical route to...
research
08/23/2023

A Unified Framework for 3D Point Cloud Visual Grounding

3D point cloud visual grounding plays a critical role in 3D scene compre...
research
11/29/2019

OptiBox: Breaking the Limits of Proposals for Visual Grounding

The problem of language grounding has attracted much attention in recent...
research
03/29/2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...

Please sign up or login with your details

Forgot password? Click here to reset