Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

05/16/2023
by   Shuwei Feng, et al.
0

This paper presents GenDoc, a general sequence-to-sequence document understanding model pre-trained with unified masking across three modalities: text, image, and layout. The proposed model utilizes an encoder-decoder architecture, which allows for increased adaptability to a wide range of downstream tasks with diverse output formats, in contrast to the encoder-only models commonly employed in document understanding. In addition to the traditional text infilling task used in previous encoder-decoder models, our pre-training extends to include tasks of masked image token prediction and masked layout prediction. We also design modality-specific instruction and adopt both disentangled attention and the mixture-of-modality-experts strategy to effectively capture the information leveraged by each modality. Evaluation of the proposed model through extensive experiments on several downstream tasks in document understanding demonstrates its ability to achieve superior or competitive performance compared to state-of-the-art approaches. Our analysis further suggests that GenDoc is more robust than the encoder-only models in scenarios where the OCR quality is imperfect.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2023

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

In this paper, we present StrucTexTv2, an effective document image pre-t...
research
12/31/2019

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of ...
research
06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...
research
05/24/2021

StructuralLM: Structural Pre-training for Form Understanding

Large pre-trained language models achieve state-of-the-art results when ...
research
06/02/2023

DocFormerv2: Local Features for Document Understanding

We propose DocFormerv2, a multi-modal transformer for Visual Document Un...
research
09/03/2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

We propose a novel end-to-end document understanding model called SeRum ...
research
08/17/2022

UniLayout: Taming Unified Sequence-to-Sequence Transformers for Graphic Layout Generation

To satisfy various user needs, different subtasks of graphic layout gene...

Please sign up or login with your details

Forgot password? Click here to reset