IQA: Visual Question Answering in Interactive Environments

12/09/2017
by   Daniel Gordon, et al.
0

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN) consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction, reducing the diversity of the action space available to each controller and enabling an easier training paradigm. We introduce IQADATA, a new Interactive Question Answering dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects. IQADATA has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQADATA.

READ FULL TEXT

page 1

page 8

research
08/28/2019

Interactive Language Learning by Question Answering

Humans observe and interact with the world to acquire knowledge. However...
research
12/14/2017

AI2-THOR: An Interactive 3D Environment for Visual AI

We introduce The House Of inteRactions (THOR), a framework for visual AI...
research
09/11/2018

Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions

In-depth scene descriptions and question answering tasks have greatly in...
research
06/01/2022

SAMPLE-HD: Simultaneous Action and Motion Planning Learning Environment

Humans exhibit incredibly high levels of multi-modal understanding - com...
research
05/10/2021

PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

Scene Rearrangement Planning (SRP) is an interior task proposed recently...
research
05/31/2019

Visual Understanding and Narration: A Deeper Understanding and Explanation of Visual Scenes

We describe the task of Visual Understanding and Narration, in which a r...
research
10/06/2022

Embodied Referring Expression for Manipulation Question Answering in Interactive Environment

Embodied agents are expected to perform more complicated tasks in an int...

Please sign up or login with your details

Forgot password? Click here to reset