Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

by   Tomasz Stanislawek, et al.

The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77 F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

State-of-the-art solutions for Natural Language Processing (NLP) are abl...

Polish Natural Language Inference and Factivity – an Expert-based Dataset and Benchmarks

Despite recent breakthroughs in Machine Learning for Natural Language Pr...

DESCGEN: A Distantly Supervised Dataset for Generating Abstractive Entity Descriptions

Short textual descriptions of entities provide summaries of their key at...

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

In this paper, we study the problem of extracting variable-depth "logica...

LexNLP: Natural language processing and information extraction for legal and regulatory texts

LexNLP is an open source Python package focused on natural language proc...

A perspective on the advancement of natural language processing tasks via topological analysis of complex networks

Comment on "Approaching human language with complex networks" by Cong an...

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.