Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema Data

05/31/2018
by   Yeting Li, et al.
0

Regular expressions are a fundamental concept in computer science and widely used in various applications. In this paper we focused on deterministic regular expressions (DREs). Considering that researchers didn't have large datasets as evidence before, we first harvested a large corpus of real data from the Web then conducted a practical study to investigate the usage of DREs. One feature of our work is that the data set is sufficiently large compared with previous work, which is obtained using several data collection strategies we proposed. The results show more than 98% of expressions in Relax NG are DRE, and more than 56% of expressions from RegExLib are DRE, while both Relax NG and RegExLib do not have the determinism constraint. These observations indicate that DREs are commonly used in practice. The results also show further study of subclasses of DREs is necessary. As far as we know, we are the first to analyze the determinism and the subclasses of DREs of Relax NG and RegExLib, and give these results. Furthermore, we give some discussions and applications of the data set. We obtain a DRE data set from the original data, which will be useful in practice and it has value in its own right. We find current research in new subclasses of DREs is insufficient, therefore it is necessary to do further study. We also analyze the referencing relationships among XSDs and define SchemaRank, which can be used in XML Schema design.

READ FULL TEXT
research
04/30/2019

Learning Restricted Regular Expressions with Interleaving

The advantages for the presence of an XML schema for XML documents are n...
research
06/05/2019

An Effective Algorithm for Learning Single Occurrence Regular Expressions with Interleaving

The advantages offered by the presence of a schema are numerous. However...
research
02/05/2018

Deterministic Regular Expressions With Back-References

Most modern libraries for regular expression matching allow back-referen...
research
01/04/2023

Grammar construction methods for extended deterministic expressions

Extended regular expressions with counting and interleaving are widely u...
research
06/14/2022

Learning from Uncurated Regular Expressions

Significant work has been done on learning regular expressions from a se...
research
10/30/2022

gMeta: Template-based Regular Expression Generation over Noisy Examples

Regular expressions (regexes) are widely used in different fields of com...
research
07/03/2022

The Impact of Partner Expressions on Felt Emotion in the Iterated Prisoner's Dilemma: An Event-level Analysis

Social games like the prisoner's dilemma are often used to develop model...

Please sign up or login with your details

Forgot password? Click here to reset