Multimodal grid features and cell pointers for Scene Text Visual Question Answering

06/01/2020
by   Lluis Gómez, et al.
7

This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on an attention mechanism that attends to multi-modal features conditioned to the question, allowing it to reason jointly about the textual and visual modalities in the scene. The output weights of this attention module over the grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text the to the given question. Our experiments demonstrate competitive performance in two standard datasets. Furthermore, this paper provides a novel analysis of the ST-VQA dataset based on a human performance study.

READ FULL TEXT

page 1

page 3

page 4

page 5

research
05/11/2018

Reciprocal Attention Fusion for Visual Question Answering

Existing attention mechanisms either attend to local image grid or objec...
research
03/31/2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Answering questions that require reading texts in an image is challengin...
research
04/04/2023

Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA

In this paper, we propose a novel multi-modal framework for Scene Text V...
research
02/28/2023

VQA with Cascade of Self- and Co-Attention Blocks

The use of complex attention modules has improved the performance of the...
research
08/10/2022

CLEVR-Math: A Dataset for Compositional Language, Visual and Mathematical Reasoning

We introduce CLEVR-Math, a multi-modal math word problems dataset consis...
research
05/22/2019

AttentionRNN: A Structured Spatial Attention Mechanism

Visual attention mechanisms have proven to be integrally important const...
research
06/01/2020

Structured Multimodal Attentions for TextVQA

Text based Visual Question Answering (TextVQA) is a recently raised chal...

Please sign up or login with your details

Forgot password? Click here to reset