An Effective Algorithm for Learning Single Occurrence Regular Expressions with Interleaving

06/05/2019
by   Yeting Li, et al.
0

The advantages offered by the presence of a schema are numerous. However, many XML documents in practice are not accompanied by a (valid) schema, making schema inference an attractive research problem. The fundamental task in XML schema learning is inferring restricted subclasses of regular expressions. Most previous work either lacks support for interleaving or only has limited support for interleaving. In this paper, we first propose a new subclass Single Occurrence Regular Expressions with Interleaving (SOIRE), which has unrestricted support for interleaving. Then, based on single occurrence automaton and maximum independent set, we propose an algorithm iSOIRE to infer SOIREs. Finally, we further conduct a series of experiments on real datasets to evaluate the effectiveness of our work, comparing with both ongoing learning algorithms in academia and industrial tools in real-world. The results reveal the practicability of SOIRE and the effectiveness of iSOIRE, showing the high preciseness and conciseness of our work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2019

Learning Restricted Regular Expressions with Interleaving

The advantages for the presence of an XML schema for XML documents are n...
research
12/01/2022

A Noise-tolerant Differentiable Learning Approach for Single Occurrence Regular Expression with Interleaving

We study the problem of learning a single occurrence regular expression ...
research
05/31/2018

Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data

Regular expressions are a fundamental concept in computer science and wi...
research
02/05/2018

Deterministic Regular Expressions With Back-References

Most modern libraries for regular expression matching allow back-referen...
research
07/19/2021

An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents (Long Version)

In this paper, we study the usage of negation in JSON Schema data modeli...
research
06/14/2022

Learning from Uncurated Regular Expressions

Significant work has been done on learning regular expressions from a se...
research
06/08/2022

Towards Schema Inference for Data Lakes

A data lake is a repository of data with potential for future analysis. ...

Please sign up or login with your details

Forgot password? Click here to reset