Visual Question Answering on Image Sets

08/27/2020
by   Ankan Bansal, et al.
1

We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets - indoor and outdoor scenes. They simulate the real-world scenarios of indoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases in dataset, and question-image dependencies. We also build new baseline models to investigate new research challenges in ISVQA.

READ FULL TEXT

page 2

page 7

page 9

research
12/22/2021

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

3D scene understanding is a relatively emerging research field. In this ...
research
04/12/2017

What's in a Question: Using Visual Questions as a Form of Supervision

Collecting fully annotated image datasets is challenging and expensive. ...
research
12/20/2021

ScanQA: 3D Question Answering for Spatial Scene Understanding

We propose a new 3D spatial understanding task of 3D Question Answering ...
research
10/01/2014

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

We propose a method for automatically answering questions about images b...
research
10/04/2016

Tutorial on Answering Questions about Images with Deep Learning

Together with the development of more accurate methods in Computer Visio...
research
11/10/2021

3D modelling of survey scene from images enhanced with a multi-exposure fusion

In current practice, scene survey is carried out by workers using total ...
research
01/09/2017

Information Pursuit: A Bayesian Framework for Sequential Scene Parsing

Despite enormous progress in object detection and classification, the pr...

Please sign up or login with your details

Forgot password? Click here to reset