DeepAI AI Chat
Log In Sign Up

Visual7W: Grounded Question Answering in Images

11/11/2015
by   Yuke Zhu, et al.
0

We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.

READ FULL TEXT

page 1

page 3

page 7

page 8

12/20/2021

ScanQA: 3D Question Answering for Spatial Scene Understanding

We propose a new 3D spatial understanding task of 3D Question Answering ...
03/07/2019

RAVEN: A Dataset for Relational and Analogical Visual rEasoNing

Dramatic progress has been witnessed in basic vision tasks involving low...
05/08/2015

Exploring Models and Data for Image Question Answering

This work aims to address the problem of image-based question-answering ...
08/29/2018

From VQA to Multimodal CQA: Adapting Visual QA Models for Community QA Tasks

In this work, we present novel methods to adapt visual QA models for com...
09/11/2018

The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR

Visual QA is a pivotal challenge for higher-level reasoning, requiring u...
05/27/2020

Object-QA: Towards High Reliable Object Quality Assessment

In object recognition applications, object images usually appear with di...
04/16/2021

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tas...