Towards VQA Models that can Read

04/18/2019
by   Amanpreet Singh, et al.
12

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

READ FULL TEXT

page 1

page 3

page 5

page 13

page 14

page 15

page 16

research
02/15/2019

Cycle-Consistency for Robust Visual Question Answering

Despite significant progress in Visual Question Answering over the years...
research
08/01/2023

Making the V in Text-VQA Matter

Text-based VQA aims at answering questions by reading the text present i...
research
10/06/2020

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Image text carries essential information to understand the scene and per...
research
06/30/2019

ICDAR 2019 Competition on Scene Text Visual Question Answering

This paper presents final results of ICDAR 2019 Scene Text Visual Questi...
research
01/20/2020

SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions

Existing VQA datasets contain questions with varying levels of complexit...
research
11/11/2021

Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture

Previous studies such as VizWiz find that Visual Question Answering (VQA...
research
10/04/2016

Tutorial on Answering Questions about Images with Deep Learning

Together with the development of more accurate methods in Computer Visio...

Please sign up or login with your details

Forgot password? Click here to reset