Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

07/23/2023
by   Menghao Li, et al.
0

Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the robustness of VG, we also present a multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new state-of-the-art (SOTA) results, with improvements of 25% and 10% compared to existing SOTA approaches on the two newly proposed robust VG datasets. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 7

page 8

research
09/29/2022

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

3D visual grounding aims to find the objects within point clouds mention...
research
07/21/2023

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between v...
research
09/27/2022

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-...
research
05/23/2023

Cross3DVG: Baseline and Dataset for Cross-Dataset 3D Visual Grounding on Different RGB-D Scans

We present Cross3DVG, a novel task for cross-dataset visual grounding in...
research
03/23/2023

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding

Aiming to link natural language descriptions to specific regions in a 3D...
research
06/29/2023

Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization

The rapid advancements in computer vision have stimulated remarkable pro...
research
11/08/2022

Detecting Euphemisms with Literal Descriptions and Visual Imagery

This paper describes our two-stage system for the Euphemism Detection sh...

Please sign up or login with your details

Forgot password? Click here to reset