Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting

11/24/2022
by   Tobias Kalmbach, et al.
0

Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about 33.8%. We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found in the HCI domain. Additionally, we compare extraction on PDF and LaTeX source files, finding LaTeX to be more reliable for extraction.

READ FULL TEXT
research
07/14/2011

Stereo pairs in Astrophysics

Stereoscopic visualization is seldom used in Astrophysical publications ...
research
03/25/2021

A Machine Learning Pipeline for Automatic Extraction of Statistic Reports and Experimental Conditions from Scientific Papers

A common writing style for statistical results are the recommendations o...
research
08/09/2016

Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge

This paper explores the task of translating natural language queries int...
research
06/14/2022

Learning from Uncurated Regular Expressions

Significant work has been done on learning regular expressions from a se...
research
12/04/2020

Data-Driven Regular Expressions Evolution for Medical Text Classification Using Genetic Programming

In medical fields, text classification is one of the most important task...
research
01/26/2021

pdfPapers: shell-script utilities for frequency-based multi-word phrase extraction from PDF documents

Biomedical research is intensive in processing information in the previo...
research
08/04/2021

Multi-Round Parsing-based Multiword Rules for Scientific OpenIE

Information extraction (IE) in scientific literature has facilitated man...

Please sign up or login with your details

Forgot password? Click here to reset