LaTr: Layout-Aware Transformer for Scene-Text VQA

12/23/2021
by   Ali Furkan Biten, et al.
5

We propose a novel multimodal architecture for Scene Text Visual Question Answering (STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to reason over different modalities. Thus, we first investigate the impact of each modality, and reveal the importance of the language module, especially when enriched with layout information. Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues. We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images, despite the domain gap. Scanned documents are easy to procure, text-dense and have a variety of layouts, helping the model learn various spatial cues (e.g. left-of, below etc.) by tying together language and layout information. Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary. We further demonstrate that LaTr improves robustness towards OCR errors, a common reason for failure cases in STVQA. In addition, by leveraging a vision transformer, we eliminate the need for an external object detector. LaTr outperforms state-of-the-art STVQA methods on multiple datasets. In particular, +7.6 and +4.0

READ FULL TEXT

page 3

page 6

page 15

page 16

page 21

page 22

research
12/08/2020

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and...
research
09/09/2022

Pre-training image-language transformers for open-vocabulary tasks

We present a pre-training approach for vision and language transformer m...
research
09/12/2022

PreSTU: Pre-Training for Scene-Text Understanding

The ability to read and reason about texts in an image is often lacking ...
research
08/31/2023

Separate and Locate: Rethink the Text in Text-based Visual Question Answering

Text-based Visual Question Answering (TextVQA) aims at answering questio...
research
06/02/2023

DocFormerv2: Local Features for Document Understanding

We propose DocFormerv2, a multi-modal transformer for Visual Document Un...
research
02/18/2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension bey...
research
02/19/2020

LAMBERT: Layout-Aware language Modeling using BERT for information extraction

In this paper we introduce a novel approach to the problem of understand...

Please sign up or login with your details

Forgot password? Click here to reset