LAME: Layout Aware Metadata Extraction Approach for Research Articles

12/23/2021
by   Jongyun Choi, et al.
0

The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide, and research on metadata extraction is ongoing. However, high-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers. To accommodate the diversity of the layouts of academic journals, we propose a novel LAyout-aware Metadata Extraction (LAME) framework equipped with the three characteristics (e.g., design of an automatic layout analysis, construction of a large meta-data training set, and construction of Layout-MetaBERT). We designed an automatic layout analysis using PDFMiner. Based on the layout analysis, a large volume of metadata-separated training data, including the title, abstract, author name, author affiliated organization, and keywords, were automatically extracted. Moreover, we constructed Layout-MetaBERT to extract the metadata from academic journals with varying layout formats. The experimental results with Layout-MetaBERT exhibited robust performance (Macro-F1, 93.27 layout formats.

READ FULL TEXT

page 2

page 9

research
07/24/2018

Rule Based Metadata Extraction Framework from Academic Articles

Metadata of scientific articles such as title, abstract, keywords or ind...
research
08/08/2022

Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features

While storing invoice content as metadata to avoid paper document proces...
research
11/28/2021

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

With the development of Internet technology, the phenomenon of informati...
research
03/17/2023

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

Extracting information from academic PDF documents is crucial for numero...
research
09/09/2020

Development of a Predictive Process Design kit for15-nm FinFETs: FreePDK15

FinFETs are predicted to advance semiconductorscaling for sub-20nm devic...
research
05/27/2023

A Framework For Refining Text Classification and Object Recognition from Academic Articles

With the widespread use of the internet, it has become increasingly cruc...
research
05/27/2019

Analyzing Turkish F and Turkish E keyboard layouts using learning curves

The F-layout was introduced in 1955 and eventually enforced as a nationa...

Please sign up or login with your details

Forgot password? Click here to reset