LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

by   Yiheng Xu, et al.

Multimodal pre-training with text, layout, and image has achieved SOTA performance for visually-rich document understanding tasks recently, which demonstrates the great potential for joint learning across different modalities. In this paper, we present LayoutXLM, a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. To accurately evaluate LayoutXLM, we also introduce a multilingual form understanding benchmark dataset named XFUN, which includes form understanding samples in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese), and key-value pairs are manually labeled for each language. Experiment results show that the LayoutXLM model has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUN dataset. The pre-trained LayoutXLM model and the XFUN dataset will be publicly available at



There are no comments yet.


page 1

page 2

page 3

page 4


MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has made significan...

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Humans learn language by listening, speaking, writing, reading, and also...

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

In this paper, we introduce XGLUE, a new benchmark dataset to train larg...

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

This paper presents a Multitask Multilingual Multimodal Pre-trained mode...

DOCmT5: Document-Level Pretraining of Multilingual Language Models

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence ...

Value Retrieval with Arbitrary Queries for Form-like Documents

We propose value retrieval with arbitrary queries for form-like document...

Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training

The pre-trained neural models have recently achieved impressive performa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.