DocFormer: End-to-End Transformer for Document Understanding

06/22/2021
by   Srikar Appalaraju, et al.
1

We present DocFormer – a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

READ FULL TEXT

page 8

page 16

page 17

page 18

page 19

page 20

page 21

research
05/17/2022

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...
research
06/02/2023

DocFormerv2: Local Features for Document Understanding

We propose DocFormerv2, a multi-modal transformer for Visual Document Un...
research
05/19/2023

Fast-StrucTexT: An Efficient Hourglass Transformer with Modality-guided Dynamic Token Merge for Document Understanding

Transformers achieve promising performance in document understanding bec...
research
05/11/2023

EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification

In the recent past, complex deep neural networks have received huge inte...
research
05/31/2021

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

Inspired by biological evolution, we explain the rationality of Vision T...
research
09/20/2021

Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions

Personality computing has become an emerging topic in computer vision, d...
research
02/18/2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension bey...

Please sign up or login with your details

Forgot password? Click here to reset