New Methods for Metadata Extraction from Scientific Literature

10/27/2017
by   Dominika Tkaczyk, et al.
0

Within the past few decades we have witnessed digital revolution, which moved scholarly communication to electronic media and also resulted in a substantial increase in its volume. Nowadays keeping track with the latest scientific achievements poses a major challenge for the researchers. Scientific information overload is a severe problem that slows down scholarly communication and knowledge propagation across the academia. Modern research infrastructures facilitate studying scientific literature by providing intelligent search tools, proposing similar and related documents, visualizing citation and author networks, assessing the quality and impact of the articles, and so on. In order to provide such high quality services the system requires the access not only to the text content of stored documents, but also to their machine-readable metadata. Since in practice good quality metadata is not always available, there is a strong demand for a reliable automatic method of extracting machine-readable metadata directly from source documents. This research addresses these problems by proposing an automatic, accurate and flexible algorithm for extracting wide range of metadata directly from scientific articles in born-digital form. Extracted information includes basic document metadata, structured full text and bibliography section. Designed as a universal solution, proposed algorithm is able to handle a vast variety of publication layouts with high precision and thus is well-suited for analyzing heterogeneous document collections. This was achieved by employing supervised and unsupervised machine-learning algorithms trained on large, diverse datasets. The evaluation we conducted showed good performance of proposed metadata extraction algorithm. The comparison with other similar solutions also proved our algorithm performs better than competition for most metadata types.

READ FULL TEXT
research
07/24/2018

Rule Based Metadata Extraction Framework from Academic Articles

Metadata of scientific articles such as title, abstract, keywords or ind...
research
07/01/2021

Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations

Electronic Theses and Dissertations (ETDs) contain domain knowledge that...
research
10/31/2022

User Manual of Automatic Data Curation Tool(ADCT): A bulk data curator software in Library and Information Science

In library and information science, document storage and user-specific d...
research
12/16/2019

Pipelines for Procedural Information Extraction from Scientific Literature: Towards Recipes using Machine Learning and Data Science

This paper describes a machine learning and data science pipeline for st...
research
09/19/2023

Semi-automatic staging area for high-quality structured data extraction from scientific literature

In this study, we propose a staging area for ingesting new superconducto...
research
11/10/2021

Multimodal Approach for Metadata Extraction from German Scientific Publications

Nowadays, metadata information is often given by the authors themselves ...
research
06/20/2019

Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets

Automatically extracted metadata from scholarly documents in PDF formats...

Please sign up or login with your details

Forgot password? Click here to reset