LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

by   Yang Xu, et al.

Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks due to its effective model architecture and the advantage of large-scale unlabeled scanned/digital-born documents. In this paper, we present LayoutLMv2 by pre-training text, layout and image in a multi-modal framework, where new model architectures and pre-training tasks are leveraged. Specifically, LayoutLMv2 not only uses the existing masked visual-language modeling task but also the new text-image alignment and text-image matching tasks in the pre-training stage, where cross-modality interaction is better learned. Meanwhile, it also integrates a spatial-aware self-attention mechanism into the Transformer architecture, so that the model can fully understand the relative positional relationship among different text blocks. Experiment results show that LayoutLMv2 outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672).



There are no comments yet.


page 1

page 2

page 3

page 4


MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significan...

DocFormer: End-to-End Transformer for Document Understanding

We present DocFormer – a multi-modal transformer based architecture for ...

Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation Learning

In this paper, we propose a multi-task learning-based framework that uti...

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Structured text understanding on Visually Rich Documents (VRDs) is a cru...

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for do...

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Information extraction (IE) from visually-rich documents (VRDs) has achi...

Value Retrieval with Arbitrary Queries for Form-like Documents

We propose value retrieval with arbitrary queries for form-like document...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.