Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

10/06/2020
by   Wei Han, et al.
0

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

READ FULL TEXT

page 7

page 8

page 9

page 13

page 14

research
06/30/2019

ICDAR 2019 Competition on Scene Text Visual Question Answering

This paper presents final results of ICDAR 2019 Scene Text Visual Questi...
research
11/14/2019

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Many visual scenes contain text that carries crucial information, and it...
research
08/20/2021

Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling

As an important task in multimodal context understanding, Text-VQA (Visu...
research
03/24/2022

Towards Escaping from Language Bias and OCR Error: Semantics-Centered Text Visual Question Answering

Texts in scene images convey critical information for scene understandin...
research
04/18/2019

Towards VQA Models that can Read

Studies have shown that a dominant class of questions asked by visually ...
research
09/03/2019

Data Interpretation over Plots

Reasoning over plots by question answering (QA) is a challenging machine...
research
06/01/2020

Structured Multimodal Attentions for TextVQA

Text based Visual Question Answering (TextVQA) is a recently raised chal...

Please sign up or login with your details

Forgot password? Click here to reset