Multimodal Pre-training Based on Graph Attention Network for Document Understanding

03/25/2022
by   Zhenrong Zhang, et al.
5

Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, so we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 6

page 8

page 9

page 11

research
09/02/2021

Skim-Attention: Learning to Focus via Document Layout

Transformer-based pre-training techniques of text and layout have proven...
research
03/01/2023

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

In this paper, we present StrucTexTv2, an effective document image pre-t...
research
04/18/2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progres...
research
08/15/2023

Enhancing Visually-Rich Document Understanding via Layout Structure Modeling

In recent years, the use of multi-modal pre-trained Transformers has led...
research
05/25/2021

ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents

Recent grid-based document representations like BERTgrid allow the simul...
research
04/16/2019

Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

Images and text co-occur everywhere on the web, but explicit links betwe...
research
08/24/2015

A Framework for Comparing Groups of Documents

We present a general framework for comparing multiple groups of document...

Please sign up or login with your details

Forgot password? Click here to reset