Rule Based Metadata Extraction Framework from Academic Articles

07/24/2018
by   Jahongir Azimjonov, et al.
0

Metadata of scientific articles such as title, abstract, keywords or index terms, body text, conclusion, reference and others play a decisive role in collecting, managing and storing academic data in scientific databases, academic journals and digital libraries. An accurate extraction of these kinds of data from scientific papers is crucial to organize and retrieve important scientific information for researchers as well as librarians. Research social network systems and academic digital library systems provide academic data extracting, organizing and retrieving services. Mostly these types of services are not free or open source. They also have some performance problems and extracting limitations in the number of PDF (Portable Document Format) files that you can upload to the extraction systems. In this paper, a completely free and open source Java based high performance metadata extraction framework is proposed. This frameworks extraction speed is 9-10 times faster than existing metadata extraction systems. It is also flexible in that it allows uploading of unlimited number of PDF files. In this approach, titles of papers are extracted using layout features, font and size characteristics of text. Other metadata fields such as abstracts, body text, keywords, conclusions and references are extracted from PDF files using fixed rule sets. Extracted metadata are stored in both Oracle database and XML (Extensible Markup Language) file. This framework can be used to make scientific collections in digital libraries, online journals, online and offline scientific databases, government research agencies and research centers.

READ FULL TEXT
research
12/23/2021

LAME: Layout Aware Metadata Extraction Approach for Research Articles

The volume of academic literature, such as academic conference papers an...
research
10/27/2017

New Methods for Metadata Extraction from Scientific Literature

Within the past few decades we have witnessed digital revolution, which ...
research
06/21/2018

Metadata Enrichment of Multi-Disciplinary Digital Library: A Semantic-based Approach

In the scientific digital libraries, some papers from different research...
research
03/17/2023

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

Extracting information from academic PDF documents is crucial for numero...
research
07/14/2019

Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment

Today, the operating TAIGA (Tunka Advanced Instrument for cosmic rays an...
research
04/17/2018

Prioritizing and Scheduling Conferences for Metadata Harvesting in dblp

Maintaining literature databases and online bibliographies is a core res...
research
09/17/2020

Extensible Data Skipping

Data skipping reduces I/O for SQL queries by skipping over irrelevant da...

Please sign up or login with your details

Forgot password? Click here to reset