CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

12/22/2021
by   Xu Yan, et al.
0

3D scene understanding is a relatively emerging research field. In this paper, we introduce the Visual Question Answering task in 3D real-world scenes (VQA-3D), which aims to answer all possible questions given a 3D scene. To tackle this problem, the first VQA-3D dataset, namely CLEVR3D, is proposed, which contains 60K questions in 1,129 real-world scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Built upon this dataset, we further design the first VQA-3D baseline model, TransVQA3D. The TransVQA3D model adopts well-designed Transformer architectures to achieve superior VQA-3D performance, compared with the pure language baseline and previous 3D reasoning methods directly applied to 3D scenarios. Experimental results verify that taking VQA-3D as an auxiliary task can boost the performance of 3D scene understanding, including scene graph analysis for the node-wise classification and whole-graph recognition.

READ FULL TEXT

page 1

page 3

page 5

page 6

research
05/31/2019

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge

Visual Question Answering (VQA) in its ideal form lets us study reasonin...
research
08/27/2020

Visual Question Answering on Image Sets

We introduce the task of Image-Set Visual Question Answering (ISVQA), wh...
research
03/15/2022

Can you even tell left from right? Presenting a new challenge for VQA

Visual Question Answering (VQA) needs a means of evaluating the strength...
research
11/11/2021

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Previous studies such as VizWiz find that Visual Question Answering (VQA...
research
05/24/2019

Deep Reason: A Strong Baseline for Real-World Visual Reasoning

This paper presents a strong baseline for real-world visual reasoning (G...
research
08/08/2019

From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense

Visual Question Answering (VQA) is a challenging task for evaluating the...
research
07/23/2019

Graph Reasoning Networks for Visual Question Answering

The interaction between language and visual information has been emphasi...

Please sign up or login with your details

Forgot password? Click here to reset