DocFormerv2: Local Features for Document Understanding

06/02/2023
by   Srikar Appalaraju, et al.
0

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3 capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

READ FULL TEXT
research
06/22/2021

DocFormer: End-to-End Transformer for Document Understanding

We present DocFormer – a multi-modal transformer based architecture for ...
research
05/17/2022

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...
research
02/18/2021

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension bey...
research
09/03/2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

We propose a novel end-to-end document understanding model called SeRum ...
research
06/15/2021

End-to-End Learning of Keypoint Representations for Continuous Control from Images

In many control problems that include vision, optimal controls can be in...
research
05/16/2023

Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

This paper presents GenDoc, a general sequence-to-sequence document unde...
research
12/23/2021

LaTr: Layout-Aware Transformer for Scene-Text VQA

We propose a novel multimodal architecture for Scene Text Visual Questio...

Please sign up or login with your details

Forgot password? Click here to reset