pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents

01/26/2021
by   Pavel Loskot, et al.
0

Biomedical research is intensive in processing information in the previously published papers. This motivated a lot of efforts to provide tools for text mining and information extraction from PDF documents over the past decade. The *nix (Unix/Linux) operating systems offer many tools for working with text files, however, very few such tools are available for processing the contents of PDF files. This paper reports our effort to develop shell script utilities for *nix systems with the core functionality focused on viewing and searching multiple PDF documents combining logical and regular expressions, and enabling more reliable text extraction from PDF documents with subsequent manipulation of the resulting blocks of text. Furthermore, a procedure for extracting the most frequently occurring multi-word phrases was devised and then demonstrated on several scientific papers in life sciences. Our experiments revealed that the procedure is surprisingly robust to deficiencies in text extraction and the actual scoring function used to rank the phrases in terms of their importance or relevance. The keyword relevance is strongly context dependent, the word stemming did not provide any recognizable advantage, and the stop-words should only be removed from the beginning and the end of phrases. In addition, the developed utilities were used to convert the list of acronyms and the index from a PDF e-book into a large list of biochemical terms which can be exploited in other text mining tasks. All shell scripts and data files are available in a public repository named on the Github. The key lesson learned in this work is that semi-automated methods combining the power of algorithms with the capabilities of research experience are the most promising for improving the research efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2018

Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings

Keyword extraction is a fundamental task in natural language processing ...
research
01/27/2018

Using Additional Indexes for Fast Full-Text Search of Phrases That Contains Frequently Used Words

Searches for phrases and word sets in large text arrays by means of addi...
research
06/28/2022

Phrase Mining

Extracting frequent words from a collection of texts is performed on a g...
research
06/14/2020

An efficient algorithm for three-component key index construction

In this paper, proximity full-text searches in large text arrays are con...
research
05/04/2022

Hyperbolic Relevance Matching for Neural Keyphrase Extraction

Keyphrase extraction is a fundamental task in natural language processin...
research
01/09/2021

Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

Proximity full-text search is commonly implemented in contemporary full-...
research
11/24/2022

Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting

Due to the large amount of daily scientific publications, it is impossib...

Please sign up or login with your details

Forgot password? Click here to reset