Extracting Body Text from Academic PDF Documents for Text Mining

10/23/2020
by   Changfeng Yu, et al.
0

Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from arXiv.org across multiple academic disciplines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2020

A Conglomerate of Multiple OCR Table Detection and Extraction

Information representation as tables are compact and concise method that...
research
02/07/2022

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

Extracting information from unstructured text documents is a demanding t...
research
12/27/2021

Hamtajoo: A Persian Plagiarism Checker for Academic Manuscripts

In recent years, due to the high availability of electronic documents th...
research
01/13/2022

ChartText: Linking Text with Charts in Documents

Recent works show that interactive documents connecting text with visual...
research
02/26/2019

A framework for information extraction from tables in biomedical literature

The scientific literature is growing exponentially, and professionals ar...
research
08/11/2022

Figure Descriptive Text Extraction using Ontological Representation

Experimental research publications provide figure form resources includi...
research
11/11/2020

The Impact of Text Presentation on Translator Performance

Widely used computer-aided translation (CAT) tools divide documents into...

Please sign up or login with your details

Forgot password? Click here to reset