LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

04/18/2021
by   Yiheng Xu, et al.
0

Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at https://aka.ms/layoutxlm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significan...
research
07/21/2023

Multimodal Document Analytics for Banking Process Automation

In response to growing FinTech competition and the need for improved ope...
research
04/18/2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progres...
research
10/14/2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also...
research
06/15/2023

Document Entity Retrieval with Massive and Noisy Pre-training

Visually-Rich Document Entity Retrieval (VDER) is a type of machine lear...
research
03/28/2022

Large-scale Bilingual Language-Image Contrastive Learning

This paper is a technical report to share our experience and findings bu...
research
06/27/2022

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

Multi-modal document pre-trained models have proven to be very effective...

Please sign up or login with your details

Forgot password? Click here to reset