Data-Efficient Information Extraction from Form-Like Documents

01/07/2022
by   Beliz Gunel, et al.
11

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training ( 50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/27/2020

A Survey of Deep Learning Approaches for OCR and Document Understanding

Documents are a core part of many businesses in many fields such as law,...
research
02/07/2022

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

Extracting information from unstructured text documents is a demanding t...
research
12/20/2022

An Augmentation Strategy for Visually Rich Documents

Many business workflows require extracting important fields from form-li...
research
05/03/2023

DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents

Language identification describes the task of recognizing the language o...
research
11/07/2021

Information Extraction from Visually Rich Documents with Font Style Embeddings

Information extraction (IE) from documents is an intensive area of resea...
research
06/20/2022

Business Document Information Extraction: Towards Practical Benchmarks

Information extraction from semi-structured documents is crucial for fri...
research
04/28/2023

CED: Catalog Extraction from Documents

Sentence-by-sentence information extraction from long documents is an ex...

Please sign up or login with your details

Forgot password? Click here to reset