Knowing Where and What: Unified Word Block Pretraining for Document Understanding

07/28/2022
by   Song Tao, et al.
0

Due to the complex layouts of documents, it is challenging to extract information for documents. Most previous studies develop multimodal pre-trained models in a self-supervised way. In this paper, we focus on the embedding learning of word blocks containing text and layout information, and propose UTel, a language model with Unified TExt and Layout pre-training. Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks. Moreover, we replace the commonly used 1D position embedding with a 1D clipped relative position embedding. In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way. Additionally, the proposed UTel can process arbitrary-length sequences by removing the 1D position embedding, while maintaining competitive performance. Extensive experimental results show UTel learns better joint representations and achieves superior performance than previous methods on various downstream tasks, though requiring no image modality. Code is available at <https://github.com/taosong2019/UTel>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-rich Document Understanding (VrDU) has attracted much research ...
research
04/18/2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progres...
research
09/01/2021

Position Masking for Improved Layout-Aware Document Understanding

Natural language processing for document scans and PDFs has the potentia...
research
08/10/2021

BROS: A Layout-Aware Pre-trained Language Model for Understanding Documents

Understanding documents from their visual snapshots is an emerging probl...
research
05/12/2023

IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

Word embeddings, i.e., semantically meaningful vector representation of ...
research
06/01/2021

Incorporating Visual Layout Structures for Scientific Text Classification

Classifying the core textual components of a scientific paper-title, aut...
research
04/16/2021

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Document layout comprises both structural and visual (eg. font-sizes) in...

Please sign up or login with your details

Forgot password? Click here to reset