A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

03/17/2023
by   Norman Meuschke, et al.
0

Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few content elements like header metadata or bibliographic references and use smaller datasets from specific academic disciplines. We provide a large and diverse evaluation framework that supports more extraction tasks than most related datasets. Our framework builds upon DocBank, a multi-domain dataset of 1.5M annotated content elements extracted from 500K pages of research papers on arXiv. Using the new framework, we benchmark ten freely available tools in extracting document metadata, bibliographic references, tables, and other content elements from academic PDF documents. GROBID achieves the best metadata and reference extraction results, followed by CERMINE and Science Parse. For table extraction, Adobe Extract outperforms other tools, even though the performance is much lower than for other content elements. All tools struggle to extract lists, footers, and equations. We conclude that more research on improving and combining tools is necessary to achieve satisfactory extraction quality for most content elements. Evaluation datasets and frameworks like the one we present support this line of research. We make our data and code publicly available to contribute toward this goal.

READ FULL TEXT

page 10

page 12

research
07/24/2018

Rule Based Metadata Extraction Framework from Academic Articles

Metadata of scientific articles such as title, abstract, keywords or ind...
research
06/21/2022

Document Navigability: A Need for Print-Impaired

Printed documents continue to be a challenge for blind, low-vision, and ...
research
11/28/2021

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

With the development of Internet technology, the phenomenon of informati...
research
08/18/2023

VALERIE22 – A photorealistic, richly metadata annotated dataset of urban environments

The VALERIE tool pipeline is a synthetic data generator developed with t...
research
12/23/2021

LAME: Layout Aware Metadata Extraction Approach for Research Articles

The volume of academic literature, such as academic conference papers an...
research
02/03/2021

Harvest – An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums

Automatic extraction of forum posts and metadata is a crucial but challe...
research
06/13/2018

OpenEDGAR: Open Source Software for SEC EDGAR Analysis

OpenEDGAR is an open source Python framework designed to rapidly constru...

Please sign up or login with your details

Forgot password? Click here to reset