DeepAI AI Chat
Log In Sign Up

Extracting Body Text from Academic PDF Documents for Text Mining

by   Changfeng Yu, et al.

Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining applications for deeper semantic understandings. The objective is to extract complete sentences in the body text into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text features and syntactic tagging in backward traversal, and align the remaining text back to sentences and paragraphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus of PDF documents randomly selected from across multiple academic disciplines.


page 1

page 2

page 3

page 4


A Conglomerate of Multiple OCR Table Detection and Extraction

Information representation as tables are compact and concise method that...

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

Extracting information from unstructured text documents is a demanding t...

Hamtajoo: A Persian Plagiarism Checker for Academic Manuscripts

In recent years, due to the high availability of electronic documents th...

ChartText: Linking Text with Charts in Documents

Recent works show that interactive documents connecting text with visual...

A framework for information extraction from tables in biomedical literature

The scientific literature is growing exponentially, and professionals ar...

Figure Descriptive Text Extraction using Ontological Representation

Experimental research publications provide figure form resources includi...

The Impact of Text Presentation on Translator Performance

Widely used computer-aided translation (CAT) tools divide documents into...