Language Adaptive Weight Generation for Multi-task Visual Grounding

06/06/2023
by   Wei Su, et al.
0

Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 7

page 8

research
03/29/2022

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Visual grounding focuses on establishing fine-grained alignment between ...
research
06/06/2023

Referring Expression Comprehension Using Language Adaptive Inference

Different from universal object detection, referring expression comprehe...
research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
06/06/2021

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., p...
research
07/08/2019

Variational Context: Exploiting Visual and Textual Context for Grounding Referring Expressions

We focus on grounding (i.e., localizing or linking) referring expression...
research
04/27/2023

Direct Visual Servoing Based on Discrete Orthogonal Moments

This paper proposes a new approach to achieve direct visual servoing (DV...
research
03/19/2020

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...

Please sign up or login with your details

Forgot password? Click here to reset