CED: Catalog Extraction from Documents

04/28/2023
by   Tong Zhu, et al.
0

Sentence-by-sentence information extraction from long documents is an exhausting and error-prone task. As the indicator of document skeleton, catalogs naturally chunk documents into segments and provide informative cascade semantics, which can help to reduce the search space. Despite their usefulness, catalogs are hard to be extracted without the assist from external knowledge. For documents that adhere to a specific template, regular expressions are practical to extract catalogs. However, handcrafted heuristics are not applicable when processing documents from different sources with diverse formats. To address this problem, we build a large manually annotated corpus, which is the first dataset for the Catalog Extraction from Documents (CED) task. Based on this corpus, we propose a transition-based framework for parsing documents into catalog trees. The experimental results demonstrate that our proposed method outperforms baseline systems and shows a good ability to transfer. We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents. Data and code are available at <https://github.com/Spico197/CatalogExtraction>

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2022

Longtonotes: OntoNotes with Longer Coreference Chains

Ontonotes has served as the most important benchmark for coreference res...
research
07/19/2023

IncDSI: Incrementally Updatable Document Retrieval

Differentiable Search Index is a recently proposed paradigm for document...
research
05/11/2022

Utilizing coarse-grained data in low-data settings for event extraction

Annotating text data for event information extraction systems is hard, e...
research
04/26/2022

PLOD: An Abbreviation Detection Dataset for Scientific Documents

The detection and extraction of abbreviations from unstructured texts ca...
research
01/07/2022

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...
research
05/15/2023

CQE: A Comprehensive Quantity Extractor

Quantities are essential in documents to describe factual information. T...

Please sign up or login with your details

Forgot password? Click here to reset