Learning from Uncurated Regular Expressions

06/14/2022
by   Michael J. Mior, et al.
0

Significant work has been done on learning regular expressions from a set of data values. Depending on the domain, this approach can be very successful. However, significant time is required to learn these expressions and the resulting expressions can either become very complex or inaccurate in the presence of dirty data. The alternative of manually writing regular expressions becomes unattractive when faced with a large number of values which must be matched. As an alternative, we propose learning from a large corpus of manually authored, but uncurated regular expressions mined from a public repository. The advantage of this approach is that we are able to extract salient features from a set of strings with limited overhead to feature engineering. Since the set of regular expressions covers a wide range of application domains, we expect them to widely applicable. To demonstrate the potential effectiveness of our approach, we train a model using the extracted corpus of regular expressions for the class of semantic type classification. While our approach generally yields results that are inferior to the state of the art, our training data is much smaller and simpler and a closer analysis of the performance results suggests this approach holds significant promise. We also demonstrate the possibility of using uncurated regular expressions for unsupervised learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2016

Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge

This paper explores the task of translating natural language queries int...
research
09/17/2021

Games for Succinctness of Regular Expressions

We present a version of so called formula size games for regular express...
research
05/31/2018

Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data

Regular expressions are a fundamental concept in computer science and wi...
research
12/01/2022

A Noise-tolerant Differentiable Learning Approach for Single Occurrence Regular Expression with Interleaving

We study the problem of learning a single occurrence regular expression ...
research
11/24/2022

Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting

Due to the large amount of daily scientific publications, it is impossib...
research
12/28/2020

FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions

Form validators based on regular expressions are often used on digital f...
research
06/05/2019

An Effective Algorithm for Learning Single Occurrence Regular Expressions with Interleaving

The advantages offered by the presence of a schema are numerous. However...

Please sign up or login with your details

Forgot password? Click here to reset