Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

08/15/2023
by   Qiwei Li, et al.
0

In recent years, the use of multi-modal pre-trained Transformers has led to significant advancements in visually-rich document understanding. However, existing models have mainly focused on features such as text and vision while neglecting the importance of layout relationship between text nodes. In this paper, we propose GraphLayoutLM, a novel document understanding model that leverages the modeling of layout structure graph to inject document layout knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm to adjust the text sequence based on the graph structure. Additionally, our model uses a layout-aware multi-head self-attention layer to learn document layout knowledge. The proposed model enables the understanding of the spatial arrangement of text elements, improving document comprehension. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results among these datasets. Our experimental results demonstrate that our proposed method provides a significant improvement over existing approaches and showcases the importance of incorporating layout information into document understanding models. We also conduct an ablation study to investigate the contribution of each component of our model. The results show that both the graph reordering algorithm and the layout-aware multi-head self-attention layer play a crucial role in achieving the best performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Visually-rich Document Understanding (VrDU) has attracted much research ...
research
12/29/2020

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of vis...
research
04/16/2021

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding

Document layout comprises both structural and visual (eg. font-sizes) in...
research
01/16/2023

Diverse Multimedia Layout Generation with Multi Choice Learning

Designing visually appealing layouts for multimedia documents containing...
research
03/14/2022

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Recently, various multimodal networks for Visually-Rich Document Underst...
research
03/25/2022

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Document intelligence as a relatively new research topic supports many b...
research
04/07/2020

Is Graph Structure Necessary for Multi-hop Reasoning?

Recently, many works attempt to model texts as graph structure and intro...

Please sign up or login with your details

Forgot password? Click here to reset