ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents

05/25/2021
by   Weihong Lin, et al.
0

Recent grid-based document representations like BERTgrid allow the simultaneous encoding of the textual and layout information of a document in a 2D feature map so that state-of-the-art image segmentation and/or object detection models can be straightforwardly leveraged to extract key information from documents. However, such methods have not achieved comparable performance to state-of-the-art sequence- and graph-based methods such as LayoutLM and PICK yet. In this paper, we propose a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, where the input of CNN is a document image and the BERTgrid is a grid of word embeddings, to generate a more powerful grid-based document representation, named ViBERTgrid. Unlike BERTgrid, the parameters of BERT and CNN in our multimodal backbone network are trained jointly. Our experimental results demonstrate that this joint training strategy improves significantly the representation ability of ViBERTgrid. Consequently, our ViBERTgrid-based key information extraction approach has achieved state-of-the-art performance on real-world datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2020

VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach

We introduce a novel approach for scanned document representation to per...
research
07/11/2022

GMN: Generative Multi-modal Network for Practical Document Information Extraction

Document Information Extraction (DIE) has attracted increasing attention...
research
06/01/2022

HYCEDIS: HYbrid Confidence Engine for Deep Document Intelligence System

Measuring the confidence of AI models is critical for safely deploying A...
research
02/05/2021

Metaknowledge Extraction Based on Multi-Modal Documents

The triple-based knowledge in large-scale knowledge bases is most likely...
research
09/24/2018

Chargrid: Towards Understanding 2D Documents

We introduce a novel type of text representation that preserves the 2D l...
research
03/25/2022

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Document intelligence as a relatively new research topic supports many b...
research
09/11/2019

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding

For understanding generic documents, information like font sizes, column...

Please sign up or login with your details

Forgot password? Click here to reset