Document Entity Retrieval with Massive and Noisy Pre-training

06/15/2023
by   Lijun Yu, et al.
15

Visually-Rich Document Entity Retrieval (VDER) is a type of machine learning task that aims at recovering text spans in the documents for each of the entities in question. VDER has gained significant attention in recent years thanks to its broad applications in enterprise AI. Unfortunately, as document images often contain personally identifiable information (PII), publicly available data have been scarce, not only because of privacy constraints but also the costs of acquiring annotations. To make things worse, each dataset would often define its own sets of entities, and the non-overlapping entity spaces between datasets make it difficult to transfer knowledge between documents. In this paper, we propose a method to collect massive-scale, noisy, and weakly labeled data from the web to benefit the training of VDER models. Such a method will generate a huge amount of document image data to compensate for the lack of training data in many VDER settings. Moreover, the collected dataset named DocuNet would not need to be dependent on specific document types or entity sets, making it universally applicable to all VDER tasks. Empowered by DocuNet, we present a lightweight multimodal architecture named UniFormer, which can learn a unified representation from text, layout, and image crops without needing extra visual pertaining. We experiment with our methods on popular VDER models in various settings and show the improvements when this massive dataset is incorporated with UniFormer on both classic entity retrieval and few-shot learning settings.

READ FULL TEXT
research
10/12/2022

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Recent years have witnessed the rise and success of pre-training techniq...
research
12/31/2019

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Pre-training techniques have been verified successfully in a variety of ...
research
04/18/2021

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has achieved SOTA p...
research
02/12/2020

Joint Embedding in Named Entity Linking on Sentence Level

Named entity linking is to map an ambiguous mention in documents to an e...
research
09/09/2021

Tiny CNN for feature point description for document analysis: approach and dataset

In this paper, we study the problem of feature points description in the...
research
05/17/2022

Detection Masking for Improved OCR on Noisy Documents

Optical Character Recognition (OCR), the task of extracting textual info...
research
06/06/2019

One-shot Information Extraction from Document Images using Neuro-Deductive Program Synthesis

Our interest in this paper is in meeting a rapidly growing industrial de...

Please sign up or login with your details

Forgot password? Click here to reset