Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

08/18/2023
by   Navid Rajabi, et al.
0

With the advances in large scale vision-and-language models (VLMs) it is of interest to assess their performance on various visual reasoning tasks such as counting, referring expressions and general visual question answering. The focus of this work is to study the ability of these models to understanding spatial relations. Previously, this has been tackled using image-text matching (Liu, Emerson, and Collier 2022) or visual question answering task, both showing poor performance and a large gap compared to human performance. To better understand the gap, we present fine-grained compositional grounding of spatial relationships and propose a bottom up approach for ranking spatial clauses and evaluating the performance of spatial relationship reasoning task. We propose to combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative vision-language models (Tan and Bansal 2019; Gupta et al. 2022; Kamath et al. 2021) and compare and highlight their abilities to reason about spatial relationships.

READ FULL TEXT

page 1

page 4

page 6

research
02/11/2019

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Many vision and language models suffer from poor visual grounding - ofte...
research
04/12/2021

SpartQA: : A Textual Question Answering Benchmark for Spatial Reasoning

This paper proposes a question-answering (QA) benchmark for spatial reas...
research
08/11/2023

Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have made signifi...
research
09/18/2021

ReaSCAN: Compositional Reasoning in Language Grounding

The ability to compositionally map language to referents, relations, and...
research
04/03/2019

Revisiting Visual Grounding

We revisit a particular visual grounding method: the "Image Retrieval Us...
research
11/18/2017

Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates

Spatial understanding is a fundamental problem with wide-reaching real-w...
research
10/11/2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...

Please sign up or login with your details

Forgot password? Click here to reset