Kleister: A novel task for Information Extraction involving Long Documents with Complex Layout

03/04/2020
by   Filip Graliński, et al.
0

State-of-the-art solutions for Natural Language Processing (NLP) are able to capture a broad range of contexts, like the sentence level context or document level context for short documents. But these solutions are still struggling when it comes to real-world longer documents with information encoded in the spatial structure of the document, in elements like tables, forms, headers, openings or footers, or the complex layout of pages or multiple pages. To encourage progress on deeper and more complex information extraction, we present a new task (named Kleister) with two new datasets. Based on textual and structural layout features, an NLP system must find the most important information, about various types of entities, in formal long documents. These entities are not only classes from standard named entity recognition (NER) systems (e.g. location, date, or amount) but also the roles of the entities in the whole documents (e.g. company town address, report date, income amount).

READ FULL TEXT
research
05/12/2021

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

The relevance of the Key Information Extraction (KIE) task is increasing...
research
08/31/2021

TNNT: The Named Entity Recognition Toolkit

Extraction of categorised named entities from text is a complex task giv...
research
06/30/2023

Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches

Information extraction (IE) plays very important role in natural languag...
research
11/20/2019

Table-Of-Contents generation on contemporary documents

The generation of precise and detailed Table-Of-Contents (TOC) from a do...
research
05/24/2023

ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents

Transforming documents into machine-processable representations is a cha...
research
12/11/2018

Deep Reader: Information extraction from Document images via relation extraction and Natural Language

Recent advancements in the area of Computer Vision with state-of-art Neu...
research
05/10/2023

Extracting Complex Named Entities in Legal Documents via Weakly Supervised Object Detection

Accurate Named Entity Recognition (NER) is crucial for various informati...

Please sign up or login with your details

Forgot password? Click here to reset