Business Document Information Extraction: Towards Practical Benchmarks

06/20/2022
by   Matyáš Skalický, et al.
4

Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2020

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

Techniques for automatically extracting important content elements from ...
research
09/30/2013

Semi-structured data extraction and modelling: the WIA Project

Over the last decades, the amount of data of all kinds available electro...
research
07/04/2020

Detecting Opportunities for Differential Maintenance of Extracted Views

Semi-structured and unstructured data management is challenging, but man...
research
04/28/2023

Information Redundancy and Biases in Public Document Information Extraction Benchmarks

Advances in the Visually-rich Document Understanding (VrDU) field and pa...
research
03/28/2022

Understanding Questions that Arise When Working with Business Documents

While digital assistants are increasingly used to help with various prod...
research
06/03/2023

TransDocAnalyser: A Framework for Offline Semi-structured Handwritten Document Analysis in the Legal Domain

State-of-the-art offline Optical Character Recognition (OCR) frameworks ...
research
01/07/2022

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...

Please sign up or login with your details

Forgot password? Click here to reset