Spatially Aware Multimodal Transformers for TextVQA

07/23/2020
by   Yash Kant, et al.
11

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2 and 4.62 correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2 visual grounding.

READ FULL TEXT

page 14

page 22

page 23

page 24

page 25

research
12/24/2021

SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...
research
12/31/2020

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

We generalize deep self-attention distillation in MiniLM (Wang et al., 2...
research
08/22/2023

Coarse-to-Fine Multi-Scene Pose Regression with Transformers

Absolute camera pose regressors estimate the position and orientation of...
research
07/05/2022

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Transformers for visual-language representation learning have been getti...
research
11/14/2019

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Many visual scenes contain text that carries crucial information, and it...
research
11/26/2021

SWAT: Spatial Structure Within and Among Tokens

Modeling visual data as tokens (i.e., image patches), and applying atten...
research
08/25/2021

TransFER: Learning Relation-aware Facial Expression Representations with Transformers

Facial expression recognition (FER) has received increasing interest in ...

Please sign up or login with your details

Forgot password? Click here to reset