Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

07/21/2023
by   Zhihong Chen, et al.
0

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study <cit.>, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of Scene Knowledge-guided Visual Grounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at <https://github.com/zhjohnchan/SK-VG>.

READ FULL TEXT

page 14

page 15

page 16

page 17

page 18

page 19

page 20

page 21

research
09/06/2023

A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Key to tasks that require reasoning about natural language in visual con...
research
07/21/2022

Grounding Visual Representations with Texts for Domain Generalization

Reducing the representational discrepancy between source and target doma...
research
05/03/2023

A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text

Pretrained Vision-Language Models (VLMs) have achieved remarkable perfor...
research
05/04/2020

Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

Visual referring expression recognition is a challenging task that requi...
research
07/23/2023

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Visual Grounding (VG) aims at localizing target objects from an image ba...
research
12/31/2021

Deconfounded Visual Grounding

We focus on the confounding bias between language and location in the vi...
research
08/07/2019

SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition

Understanding the spatial relations between objects in images is a surpr...

Please sign up or login with your details

Forgot password? Click here to reset