I Introduction
Modern enterprises store their data in a wide range of different systems, including transactional DBMSs, data warehouses, data lakes, spreadsheets, and flat files. Data analysts often need to combine data from these diverse data sets, frequently incorporating external data from even more sources. A key challenge in this setting is finding related data sets that can be combined to answer some question of interest.
As an example, analysts at Merck—a pharmaceutical company—often need to join tables that contain chemical compounds. Unfortunately, there are at least three identifier formats (InChI, InChIKey, and SMILES, shown in 1) used internally in Merck, not to mention additional formats that may be used in external data sources. Because of this diversity of ID formats, a simple text search is not sufficient to find relevant tables—attribute names are different. Indeed, they cannot even perform an approximate search to find similar content as these identifiers are not comparable. Manually building a mapping between the identifiers in the different formats and creating a lookup table is an expensive option.
A better option would be to label the relevant attributes with useful metadata, assign a chemical identifier label to all identifier columns in the table that represent. Unfortunately, manual labeling is also infeasible in an company with large volumes of data: it requires a great deal of time and is error prone as it may miss many tables that contain relevant information, especially when considering external data.
To address this problem, we observe that many relevant attributes in enterprise databases are highly structured, they follow simple syntactical patterns. For example, in 1 the InChi number always starts with the pattern InChI= and the InChiKey is a 14character followed by a hyphen, followed by a 10character followed by another hyphen and an additional character [1]. More common examples of structured attributes are dates, product identifiers, phone numbers, enumerated types (gender, etc), and so on. Often these columns are stored as strings in the database, but if they could be labeled with richer structural information about the format of values, indexing, searching and comparing values, and finding exceptional or outliers values, could be done much more efficiently.
In this paper, we introduce Xsystem, a method to learn and represent syntactic patterns in datasets as data structures called Xtructures. Once Xsystem learns a collection of patterns, analysts can use them to conduct several commonly performed tasks, including: automatic label assignment, where data items are assigned a class by comparing them to a library of known classes (written as regexs or Xtructures); finding syntactically similar content, where learned Xtructures are compared to see if they are similar, and outlier detection, where a learned Xtructure for a single item is compared to other Xtructures to check that its structure is different. These applications share two common requirements: (i) Xtructures must be quickly synthesizable and (ii) Xtructures must be comparable to each other and to regular expressions.
In addition to supporting these requirements, Xsystem must: i) be able to work without human intervention, as neither semiautomatic nor interactive tools scale for large amounts of data; ii) learn syntactic patterns fast, which calls for both an asymptotically efficient model as well as a parallelizable implementation; and iii) be quickly synthesizable and manipulatable given only raw datasets, since this is all that many analysts may be able to initially access. Speed of learning is crucial for real world scenarios, as not all data analysis tasks can cope with stale data.
Other Methods A natural question to ask is why not to use existing methods for information extraction that different communities have been improving over the years? We discuss methods in detail in the related work section, but in general we find that these systems either lack speed, lack autonomy, or make assumptions about the problem setting that are too strong for realistic use. One general method that seems relevant for structured data is inference of regular expressions [2] to represent the structure of each column. This is unfortunately NPhard, due to the expressiveness of the model [3]. We argue, and show through many examples, that regexes’ expressive power is not necessary for syntacticbased discovery applications in databases. This is the key observation that we use to design a lessexpensive, much fastertolearn structure which can express a subset of regular expressions that satisfies most applications. We show that existing regex learning algorithms are impractically slow, taking thousands of seconds to learn a regex over just a few hundred values. Other general methods to extract patterns grow out of the need for identifying important entities within unstructured or semistructured data, such as NLPbased techniques to identify patterns from text or wrapper induction techniques to do the same on webpages built on HTML. None of those techniques are designed for the case of structured data, which has become a big bottleneck in the era of big data, with people placing a variety of structured data into data lakes.
XsystemOur approach learns syntax from examples incrementally. For each example, it exploits the existence of delimiters in known entities to split the problem of extracting the pattern into learning the syntax of each of the tokens separated by those delimiters. The underlying data structure used to learn each token is a branching linear distribution sequence that is equivalent to a Deterministic Acyclic Finite State Automaton (DAFSA), which is asymptotically simpler to learn than minimal Deterministic Finite Automata (DFA), often used in regular expression learning. The learning procedure relies on a branch and merge strategy that allows us to incrementally adapt a prior to new observed examples. This permits us to capture different syntactical structures that appear in the same column. This branch and merge strategy is also at the center of the parallelization approach used in Xsystem.
We evaluate Xsystem on the three applications mentioned above on real datasets ranging from our university’s data warehouse, open government data and a public chemical database. We find that Xsystem can form a syntactic representation of given data much faster than automatic DFA learners, and that we can use it effectively for our target applications.
Before describing the details of our implementation, we now formalize the Xsystem problem statement and requirements.
Ii Motivation and Requirements
In this section, we first restate and elaborate on the applications that motivated us to build Xsystem, and then use those applications to derive a set of requirements, which we use to derive the features of our new system.
Iia Application Scenarios
We focus on three applications of Xsystem that we have identified while working with collaborators in pharmaceutical, telecommunications and data integration applications.
1. Automatic Label Assignment. Automatic label assignment attaches a semantic type (e.g. “chemical compound ID”, “phone number”) to columns in a data set, so that users can understand the content of columns and perform semantic search for similar types of columns. A key observation is that many different semantic types are already available in regex libraries [4], and for important semantic types inside an organization, writing such a regex is relatively straightforward. For example, for the examples of 1.
Given such a table of (regex, semantic label) pairs, the idea in automatic label assignment is to learn a Xtructure for each attribute in the database, and then perform a search for similar regexes in this table, to assign a semantic label to each column in the database. This introduces two key requirements for Xtructures: 1) they must be fast to learn, since we need to infer them for every column in the database, and 2) they must be comparable to regular expressions.
2. Summarization and Attribute Comparison. Once some interesting attributes are identified, data analysts often wish to find other similar attributes across datasets (e.g., to obtain candidates for joining two datasets together.) One way to achieve this would be to compare every pair of attributes in each dataset to each other. However, a naive quadratic implementation would be prohibitively expensive, which has inspired many approaches based on setsimilarity joins and approximate methods (e.g., [5, 6]).
A complementary approach that we advocate is to learn a more compact representation, e.g., a Xtructure for each attribute in a database. We can then compare these representations instead of the raw data, which offers several benefits. First, the Xtructure is interpretable by humans, helping to identify content quickly, while requiring only a fraction of the space used by the original data. For example, in the case of the data of 2 it is possible to represent the CHEMBL id with a single pattern that entirely captures the structures. Second, if the two Xtructures can be compared directly, then that similar content can be found without performing I/O to read data from the source database. Last, the cost of learning the Xtructure is paid only once, and can be reused subsequently for other applications as we are describing in this section.
To be useful for summarization and comparison, Xsystem must learn humanreadable Xtructures, similar to regexes in common programing languages, and Xtructures must be comparable to one another, to permit finding similar content.
3. SyntaxBased Outlier Detection. One longstanding problem in data management is concerned with data quality; in particular, errors occur frequently, whether due to data entry or anomalous values or readings [7].
We observe that by learning the syntactic pattern of an attribute, we can detect many types of errors, particularly those that are syntactic outliers, elements that do not closely the match a learned Xtructure. Consider the example of 2, with real ZIP codes from Boston. The 4th cell value an erroneous ZIP code. In this case, it is possible to detect that it has a different length than other records in the same attribute and does not fit the general syntactic pattern of the column.
To be able to detect syntaxbased outliers, Xtructures must support the concept of a scoring fit, a numeric score capturing how well a value fits a learned Xtructure. Also, Xsystem to be used as an outlier detector, it must not overfit the Xtructure to all the values, or it will not detect outliers. Instead, it must represent the general syntactical pattern and not capture the content of a few outliers.
IiB Summary of Requirements
Based on the previous applications, we can summarize Xsystem’s requirement for Xtructures as follows:
Comparable to Each Other. We need to be able to compare Xtructures to each other. Intuitively, we want the distance between two Xtructures and to approximate some sort of “average distance” between the domains of possible values that range over. This is useful in our summarization and comparison application.
Comparable to DFAs/Regular Expressions. To support label assignment, we need to be able to compare a Xtructures to a regular expression.
Able to Quantify Fit. Not only should Xtructures be comparable, but we should be able to efficiently compute a numeric score that captures the goodness of fit of two Xtructures, or of a Xtructure to a regular expression.
Quick to Learn. Xsystem should be able to learn Xtructures efficiently and in parallel, to support applications that need to compare large numbers of Xtructures to each other or to regular expressions.
IiC Motivation for a New Approach
It may seem that the above requirements could be trivially satisfied by learning regular expressions (DFAs) over each column. However, as described in more detail in our related work (Section VI), regular expression learning [8, 9, 10, 2, 11, 12, 13, 14, 15] is in general an NPcomplete problem, and in practice solutions for finding regular expressions are extremely inefficient.
How inefficient is to learn regex? To build intuition about this inefficiency, we used a stateoftheart regex inference algorithm [14] to learn a regex over a few hundred tuples and found that it took around an hour to complete. Figure 3 shows the speed of learning a Xtructure from data using Xsystem with the state of the art algorithm [14]. Here we show the time to learn a regular expression or a Xtructure
over a column, as the length of the column (in tuples) grows. The genetic algorithm based method is infeasible for our target applications because it takes thousands of seconds to learn a regular expression for a single column, making it impractical to use in even a moderate collection of databases with a few hundred columns. In contrast, the performance of
Xsystem with Xtructures grows sublinearly with the number of tuples, as we will show in subsequent sections.If regular expressions were available, we could use them to solve the application scenarios we showed above. However, because regular expression learning algorithms solve a more complex problem than what is needed for the applications we have identified at a high computational cost, we sought a simpler language that is both efficient to learn and that is sufficient to capture the structure of many database columns.
The Opportunity Fortunately, we have observed that real data in databases is often quite simple, and does not require the full expressivity of DFAs/regular expressions. In particular, most attributes in database have the following properties:
Simple structure. Through the wildcard “*” and “+” operators, regexes allow infinite variability of structure within a domain. In practice, on the MassData dataset (open data from Massachusetts), we found that around 20% of columns are fixed length, over half have only 3 distinct column lengths, more than 85% have average length less than 10, and 99% have average length less than 50. This makes sense because databases are designed to be easy to manipulate and process, and constraining the data formats into wellstructured values helps achieve this goal. Further, many regular expression learning papers focus on learning a minimal regular expressions, but since database columns are already simple, minimality is not a primary concern, especially if it comes at the cost of efficiency.
Consistent structure. The optionality operator in regular expressions allows one to construct concise expressions such as “AB(C)DE.” Instead, the equivalent “”, which separates each pattern into a different branch is simpler to learn. We found 40% attributes of data.gov can be represented by at most 2 global branches, and nearly 100% by at most 8. We show the number of branches required to represent the data of 3 real datasets in 4, confirming the same trend. Again, regular expressions favor expressivity over efficiency, which isn’t necessary.
I removed the previous paragraph where we were pseudoformally introducing Xtructure repr.
In short, regexes are neither necessary (too expressive) nor sufficient (they are too slow) for solving the problem of structure learning addressed in this paper. Instead, as we show, less complex Xtructures can be learned more efficiently while still capturing the structure of real databases.
Iii Xsystem Implementation
address small comments from R2.D3
In this section we introduce the Xtructure model to learn syntactical patterns from structured data.
Iiia The Xtructure Model
The goal of Xsystem is to learn a Xtructure from examples incrementally (tuple by tuple). To do this, we design an architecture that allows us to probabilistically model each example, and thus at any point output the “current” representation. The architecture of Xtructure ( 5) has several layers with distributions at the foundation; tuples are fit into the model by passing them through this layered structure in a welldefined way. The layers are organized hierarchically, with each one taking care of a different aspect of the learning process. We explain each layer’s role next.
The bottom layer in the hierarchy is the symbol layer, which holds a distribution over the ASCII characters that occur at a given position in the input tuples. This permits us to represent a position in a tuple as a character class, an orstatement, a single character, or a wildcard (“.”) based on the distribution. For example, if a series of mm/dd/(yy)yy dates are fed to a Xtructure, the first character will hold a distribution containing only the values or , (since months
) of the year. The second will eventually converge to a uniform distribution over
and thus it will be represented with the character class digit, D. We explain how we decide each representation from later in the paper.The token layer represents sequences of characters from the original tuples, or tokens, obtained by splitting the original tuple according to a set of delimiters, , /, #. The intuition is that delimiters often capture substructure of tuples. Consider the “10/1/2017” date as an example: here the three tokens are separated by /. Each token is represented in a token representation, which is simply a linked list of symbols (from the symbol layer). For dates, the “months” in the date will be a token in the token layer, eventually represented as (0—1)D. When no delimiters are available in the data, the entire string is represented as a single token.
Tokens of different lengths cannot be represented with a single token layer. The next layer in the hierarchy, called branch layer, deals with variablelength data. A branch layer consists of a list of token layers, and can represent an entire tuple. In particular, a branch layer represents a list of words – represented by token structures – interleaved with delimiters. In our running example, we may find dates with two different formats for the year, a 4 and a 2digit one. These two variations will be represented with two different branches in a branch representation.
Each Xtructure has several branch representations to represent attributes with different syntactical patterns, for example, tuples with different lengths. It is common to find dates with many different formats, due to data quality issues, as well as IDs, capitalization typos, etc.
IiiB Learning a Xtructure
Xtructures are adapted after each input tuple is consumed. When it gets a new tuple, Xsystem chooses an existing branch for the tuple, if one exists, or creates a new branch and seeds it with the input. This decision is made based on a measure of scoring fit. The branch representation then segments the input into tokens, , based on a set of delimiters, and splits each token, , into characters, updating the token and symbol layers.
The following sections describe: (1) how we compute scoring fit in Section IIIB1, (2) the branchandmerge algorithm to support multiple branches in Section IIIB2, (3) the approach to tokenizing input tuples and feeding characters to the individual layers in Sections IIIB3 and IIIB4, and (4) an optimization to speed up learning in Section IIIB5.
IiiB1 Fitting Tuples: Scoring Fit
While learning a Xtructure , we must understand how well a tuple, , “fits” into the structure defined by . We introduce a scoring fit measure for this. More formally, given a tuple and a Xtructure, , we define an operation that indicates how far deviates from the pattern represented by . This function is useful to fit new examples, as well as to compare Xtructure, both to itself and to representations learned with other methods.
To build this function, instead of comparing each character in to a corresponding “character” in the representation (which is illdefined, since our model holds a distribution over characters rather than a single character), we look at the characters a symbol layer represents, and assign score for how close each matches the representation in symbol layer . To define , we use GetAsciiClass(c) as the UNIX class (e.g., alphanumeric, white space, etc) of a character , and as the character class of a symbol layer (referred to as max_class in Algorithm 2). We also define to be a boolean indicating whether the layer’s representation is its character class. This decision is based on a test. If its pvalue cannot be represented as an “OR” operation over characters, then,
where are the characters represented in the symbol layer , and the parameter is used to determine how much exact character matches are prioritized compared to matches in class only (i.e two characters). In practice, we set up as a reasonable value for this relative weighting.
We use to propagate the symbol layer scoring fit through a Xtructure’s layers, leading to a general scoring fit of a tuple with respect to the model. In particular, for token representations , branch representations and a modeled representation , the distance of a tuple to is defined by:
that is, the minimum distance of the tuple with one of the branch representations, , of , which is in turn defined as:
where are the tokens of the input tuple, , that are compared with the token structures, , of as follows:
Note the extra term in the last equation used to pad with null characters whichever is shorter between the token structure,
in and the token in the tuple . This ensures Xsystem does not incorrectly penalize smaller valid instances of the underlying finite language, while still creating a new branch in the structure for them. For example, a column that contains several instances of “123” and one instance of “1”, the latter would be padded with 2 null characters.IiiB2 Representing Multiple Branches
In practice, data from the same attribute may contain values with different syntactical patterns. For example, an ID might be a 10digit number, or simply “N/A”. This phenomena inspires Xtructure’s multiple branch representations (that is, why we allow to be dimensional). However, we have no way of knowing a priori how many different patterns are in a set of examples, a Xtructure must somehow manage multiple branches, updating and representing them appropriately.
Given a new input tuple, Xsystem must decide whether to fit it into an existing Xtructure branch, or create a new branch capture the tuple’s syntactical structure. For this, we use the scoring fit. For each input , Xsystem finds the “best matching” branch by doing ; if is below a branching threshold
, the tuple is fit into that branch, otherwise a new “empty” branch is created. The existence of this branching threshold introduces the challenge of how to tune it. To avoid manually tuning such hyperparameter, we introduce an adaptive
branchandmerge technique.BranchandMerge algorithm The algorithm works as follows. We hide the unintuitive and datadependent hyperparameter, and instead expose a maximum branches parameter, that indicates the maximum number of structures that are meant to be represented by a Xtructure (this is in the formal definition). This parameter can be set up based on domain knowledge, or user preference, if an analyst knows there are 3 ways of representing a business entity, he or she can choose 3 as the number of branches, as no more than those are expected to appear in the data.
Given a fixed branching threshold and the maximum number of branches desired by users, Xsystem proceeds as follows: if the number of branches ever exceeds the specified maximum, then we compute a pairwise distance between branches. The two closest branches and are merged – by fitting generated tuples by the subsumed branch into the one subsuming – and the new “branching threshold” is set to .
This adaptive mechanism allows Xsystem to correct for undershot initial thresholds, but not overshot ones, so in practice, the initial branching threshold is set to a small . The entire algorithm, including both picking the best branch and branchandmerge, is shown in further detail in Algorithm 1.
Parallel Learning One advantage of the branchandmerge algorithm is that it facilitates parallel learning. When fitting a model, we can use multiple workers, each one reading disjoint sets of tuples and fitting them independently. This has the benefit of exploiting the parallelism readily available in modern architectures, but leads to more than one representation per attribute. At this point, we can use the branchandmerge algorithm to merge the branches of the different built models, leading to a representation equivalent to the one that a single worker would have learned.
IiiB3 Tokenization and Character Fitting
To update the token layers, the input tuple is split into tokens and then each token is fed to the layers of its corresponding token structure. The tokenizer uses special characters (delimiters) as reference for alignment. The positioninig of these characters on a string is often an indicator of data type. For example, IPv4 addresses blocks are separated by “.”, while dates are usually “/” or “” delimited.
IiiB4 Modeled Representation
During modeling, after receiving a new example and determining the token structures, each token is fed to the layers of its token structure. A symbol layer, as introduced before, holds a distribution of the characters it has seen, and represents them with their character class when is statistically significant (see Algorithm 2). Each layer is modeled as a sampling problem, under the hypothesis that every character within the majority character class is equally likely. A test of independence is then performed, confirming or rejecting this hypothesis; if confirmed, the layer represents itself by its character class (lines 1112 in Algorithm 2
). If the null hypothesis is rejected, then there exists significant bias in the data source that should be captured in the representation, so the layer instead enumerates all fit tuples in an orstatement, in order of decreasing frequency, until a specified “capture percentage” of the distribution is captured. This corresponds to lines 1420 of Algorithm
2. Running this process whenever a new example is encountered ensures we always model a valid Xtructure.IiiB5 Early Stopping
When learning from structured data, it is common for much of the computation time to go to fitting tuples that do not contribute to the final Xtructure. Consider, for example, a long list of wellformatted dates. After a few tuples, the representation we are modeling will reflect the pattern, and will not change as additional tuples are processed.
We can stop
learning when the model has converged and does not change after some number of new tuples are consumed. To do this, we track how much the fit of new tuples changes during fitting. Initially the scores are expected to change a lot, they will decrease and become steady over time – especially when the data is regular. The process of early stopping is inspired by the same application of the Central Limit Theorem as in the representation comparison. The early stopping process is shown in Algorithm
3.This technique allows us to skip large amounts of data while still finding good approximate representations. The method fails when the attribute has many different branches that are seen only later. For this reason, the technique is disabled by default, and should be enabled when the user knows the data is highly regular or randomly shuffled, or if the user only desires a quick insight.
IiiC Tuple Generation and Human Readability
A Xtructure needs to generate tuples that, though not necessarily part of the given examples, conform to the domain of the examples (, formally). This is necessary for comparison, as we see in the next section. Here, we explain how to generate tuples from a Xtructure (IIIC1). Related to the generation of tuples is a string representation of Xtructure which is readable by humans, a useful property to provide an overview of the data to humans which we describe in IIIC2.
IiiC1 Generating Tuples from a Xtructure
To generate a tuple, Xsystem traverses the layers of the Xtructure bottomup. It generates characters through its symbol layers. These are concatenated into tokens by the token layer, which also takes care of interleaving the delimiters as necessary. Finally, tokens are concatenated into branches, and the generator selects randomly the branch that would be chosen to generate output a tuple.
To make sure each symbol layer generates characters leading to tuples that represent the structure well, instead of returning the representation of its character distribution, each symbol layer draws randomly from its corresponding character distribution, producing a string from the symbol layer. For convenience, the compress_layer function of Algorithm 2 returns such representation.
IiiC2 Making a Xtructure Readable
We want to serialize a Xtructure in a way that is easy to read, akin to how regexes map the underlying DFA they represent to a string. The algorithm to achieve this is similar to our tuple generation algorithm, but instead of specific tuples, we want to output the general string representation that is represented by Xtructure.
When traversing a Xtructure’s layers bottomup, we propagate partial representations along the way. First, the symbol layers return either an individual character, a character class (eg. #, \w, etc.), or a group of characters depending on the result of the chisquared test described in the previous section. Then, all the symbol layer representations of a token representation are appended, leading to a token, meaning that for a token representation with layers through , where in the following represents the concatenation operator:
These token representations are then interleaved with the appropriate delimiters (kept during the learning process) to form branch representations, given that:
Finally, this is propagated upwards again, and the representation of an entire Xtructure is simply an of all of its branches:
IiiD Complexity and Expressiveness Analysis
can we relate the results here to the points of 2.3?
We analyze next the complexity of learning a Xtructure, performing branchandmerge, serializing the Xtructure as well as matching new strings. Note that by analyzing our approach’s complexity and expressiveness we can better understand how it fits into the larger picture of techniques for information summarization and extraction.
Complexity: Earlier, we showed that a Xtructure is a DAG where each node represents a character distribution, internally implemented as a set of linked lists of character symbols. This representation supports fitness, comparison, and generation algorithms. Table II shows the time complexity of these algorithms in Xsystem– all algorithms in Xsystem are polynomial in the input size. There are three main algorithms: “Scoring”, which assigns a fit score to a candidate word as a function of how well it fits into an existing Xtructure, “Branch and Merge”, which samples a data column and decides how to contract or split the Xtructure when a new sample is introduced, and “Serialization”, which converts a Xtructure to a humanreadable and regex compatible notation. The “Scoring” algorithm is used for both building a Xtructure, as well as matching a tuple against an existing Xtructure, e.g., for outlier detection.
Expressiveness: The “Scoring” and “Branch and Merge” algorithms combined yield a data structure with the same expressiveness as that of DAFSA. Below we provide proofs of expressiveness Xtructure w.r.t. to regular languages.
Lemma III.1.
A Xtructure can be converted in polynomial time to a DAFSA.
Proof.
Since a Xtructure is a DAG where each node represents a character distribution, a DAFSA that accepts all instances accepted by Xtructure can be trivially built in polynomial time via a BFS traversal of the Xtructure. Nodes either accept a single character, or any character from a “characterclass”. Edges in the Xtructure are transitions in the resulting DAFSA. Also note that this conversion to DAFSA can be done in polynomial time because Xtructure itself is deterministic, e.g. the same string never occupies more than one branch in the Xtructure. ∎
Lemma III.2.
A DAFSA can be converted in polynomial time to a Xtructure.
Proof.
As in above proof: a Xtructure can be trivially built in polynomial time via a BFS traversal of the DAFSA. ∎
Theorem III.3.
Xtructure expressiveness is equivalent to the set of regular languages that can be represented by DAFSA.
Proof.
From the lemmas above, it follows that for every Xtructure there is at least one equivalent DAFSA, and for every DAFSA there is at least one equivalent Xtructure. ∎
Theorem III.4.
Xtructure is equally as expressive as the finite regular languages, and is thus less expressive than DFA.
Proof.
Since Xtructure is equivalent to DAFSA, and DAFSA is less expressive than DFA, it follows that Xtructure is necessarily less expressive than DFA. Specifically, Xtructure cannot minimally represent regular languages that contains cycles. ∎
Note that we are not interested in learning minimal DFA in Xsystem. Indeed, even if we had chosen a data structure that has the same expressiveness as that of DFA (e.g., it allows cycles), there is no polynomial time algorithm guaranteed to produce a DFA of size at most polynomially larger than the smallest consistent DFA using only positive samples [16].
In practice, we also do not need to learn minimal DFAs here because our positive samples are drawn from highly structured data, and instances of each language are finite, e.g., emails, telephone numbers, and chemical identifiers. The expressiveness of DAFSA alone is quite powerful and covers all of our finite languages use cases, while also doing a good job at situations where a dataset attribute is not finite and the user only cares about tuples up to a certain size. For example, assuming a dataset attribute is captured by a small cyclic DFA, but we are only interested in instances of length at most , a DAFSA that represents this finite subset of the original language, and that is at most states larger than the DFA, can be obtained in polynomial time.
Symbol  Definition 

Data column width (max tuple length).  
Number of items sampled from data column.  
Number of branches in the Xtructure.  
Number of nodes in the Xtructure. 
Algorithm  Time Complexity 

Scoring  
Branch and Merge  
Serialization 
Iv Comparing Xtructures
While there exist polynomial time algorithms for checking equivalence of DFAs [17] – which implies that Xtructure can be checked for equality against other DFAs in polynomial time – we still require a distance measure. We want to compare Xtructures, so that we can identify columns with syntactically similar values. We also wish to compare them with regexes, so that we can propagate information associated to the regexes to the columns represented by Xtructure. We explain how we achieve this in this section. We also introduce a technique in IVB to reduce the comparison complexity and allow Xsystem to be used in settings with large numbers of attributes.
Iva Measuring Similarity for Comparison
The comparison operation of Xsystem relies primarily on the scoring fit defined in the previous section and the Central Limit Theorem, as we explain below.
IvA1 Comparing with other Xtructures
We want a syntactic distance function between the structure represented by different Xtructure, such as , that returns a pair of scores between 0 and 1 representing how well the structure of each Xtructure “fits” into the other. Previously, in Section IIIB1, we discussed a scoring fit obtained when fitting a tuple to a Xtructure. We define now the fit of a Xtructure, into as the average scoring fit of the set of tuples represented by that fit .
In general, it is infeasible to generate all possible tuples represented by a Xtructure. Instead, we model
as a distribution for which we want to estimate the mean fit with a certain degree of confidence. This reduces the problem from one of generating all tuples, to one of generating a subset of tuples that will allow us to estimate the mean fit in a statistically significant manner. However, to reliably estimate the mean fit we would need the underlying distribution of the data, which we do not know. We also do not want to make assumptions about this distribution: it will be multimodal at best, and completely irregular at worst.
To address this, we use the central limit theorem. Our idea is to sample the distribution in groups, taking sample averages. These sample averages will approximate a normal distribution around , the desired mean. Thus, in order to compute the fit of in , we generate groups of tuples from and calculate their mean fit into
, as well as the standard deviation of the approximately normal distribution. We use a confidence interval of 95% to estimate the sample size.
IvA2 Comparing with regexes
When comparing the structure represented by a Xtructure, with one represented by a regex, , we also want to obtain a tuple of scoring fits: how well fits and the other way around. As with the comparison process between Xtructures, our approach involves generating tuples from the Xtructure (or regex), and then measuring how well the generated tuples fit the regex (or Xtructure). The difference from the approach in the previous section is that tuple values are binary, either fits or it does not (and vice versa).
To compute the similarity between a Xtructure
a regex, we can calculate the probability of the structure held in a
Xtructure, , fitting a regex, , as follows:where the function generates a tuple from and the function returns 1 if a tuple fits and 0 if it does not. The total number of draws, is chosen through standard application of the CLT, which allows us to treat this as estimation of
, a Bernoulli random variable, and therefore get an approximation within a certain range and confidence interval.
To compute the fitness of with respect to , we use existing libraries that produce strings from existing regular expressions, commonly known as xeger. Using one of these xegerlike tools, we generate tuples from the regular expression and then we apply the same technique in the opposite direction.
We use this approach to compare Xtructure with already existing regular expressions for our automatic label assignment application. We obtain good results that we present in the evaluation section. However, it is worth noting a few limitations of the approach with respect to the XtructureXtructure comparison method.
First, if a regex is too specific the similarity with a nearby structure may be counterintuitively low. For example, the structure of a regex that represents exactly ”ABCD” will have a low similarity to a Xtructure’s structure that represents ”ABCE”, while this would not be the case if the two structures to be compared would be represented by Xtructures.
Second, due to our need to generate tuples from the regular expression, the regex must be finite, so wildcard characters are not allowed. Although seemingly limiting, this is not a great disadvantage, as highly structured tuples will tend to lack wildcard characters – which indicates a lack of structure.
Why not compare regexes with the original data directly? A natural question is why do we compare a Xtructure with a regex instead of comparing the original data directly to the regex. There are three key advantages to our approach. First, it is easy to sample from a Xtructure, as it already represents the branches in the underlying data. The alternative would be to perform expensive random sampling on the data directly, which is difficult if we want to sample from all the possible syntactical variations. Second, sampling from Xtructure involves generating tuples inmemory and feeding them in streaming to the Xtructure, as opposed to accessing and reading data from a data source. This is especially beneficial because we need to repeate this operation every time a new regex is added to the library, which happens often when multiple analysts participate in the process. Last, it is more convenient to compare the Xtructure learned by Xsystem to the regexes as comparisons can naturally be parallelized, and once the Xtructure is learned, it is readily available to be used with other applications.
IvB Efficient Large Scale Comparison
Recall that one of our applications is to find which attributes are syntactically similar. Naively, this entails performing an allpairs comparison of Xtructure, an operation that becomes prohibitively expensive in settings with large numbers of attributes. We rely on an approximate technique based on localitysensitive hashing (LSH) [18] with minhash signatures.
A primer on LSH and minhashTo reduce the complexity of allpairs comparison from to we can model the problem as one of approximate nearest neighbors (ANN). LSH solves ANN by using a family of hash functions that bucket items – Xtructure of attributes in our case – into the same bucket when these are similar according to some similarity metric. When the using Jaccard similarity, an effective method is minhash. [19]. With the minhash method, the elements of a set are hashed with different hash functions, and the minimum hash value for each function is chosen, leading to the signature of the element. The core insight of minhash is that the probability of two minhashes being similar equals the Jaccard similarity.
Adapting LSH with minhash to Xtructure is challenging because we do not have sets of elements, but Xtructures that can generate them. The Xtructure, however, does not generate sets of tuples deterministically, and the space of tuples it represents can be very large, making it difficult to generate good minhash signatures. In addition, instead of estimating the syntactic similarity of Xtructure we would be just estimating the similarity of the sets of tuples they generate, which is not what we want. For this method to work, we need a way of generating minhash signatures from Xtructures deterministically and in a way that captures the syntactic features learned during the building process.
We solve this by generating triples of the form (character, last_hinge, index). The first element represents the character or character class, the second one is used to determine the token of which the character is part, and the last one is the position of the character within the token. Codifying all this information in triples preserves the structural information in a way that allows us to still employ minhash. For example, for the string AB;CD, we would generate the set (A,0,0),(B,0,1),(C,1,0),(D,1,1)). With the set available, we then use minhash to obtain a signature.
In our evaluation we show that this method greatly reduces the comparison runtime, with a minor reduction in accuracy.
V Evaluation
R1.D2.3  About XtructureXtructure/XtructureRegex similarity measurements, the number of generated tuples and the time cost of each XtructureTuple similarity meansurement are not provided
R2.D4  lots of other small things here
In this section, we look at the performance of Xsystem and study how it helps address our motivating applications. Using a range of real datasets and workloads we (1) study how Xsystem can propagate labels from annotated regexes to columns in the datasets (VA); (2) use Xsystem to learn Xtructures on columns of a dataset, and use these Xtructures to identify syntactically similar content (VB); and (3) use Xsystem to detect syntactical outliers from real data (VC). We also conduct a series of microbenchmarks to understand the performance of Xtructure (VD).
Datasets and setup We use the following datasets: i) university data warehouse (DHW) which consists of 161 tables and 1690 attributes with information about departments within the university; ii) ChEMBL (CHE), a public chemical database [20] with 70 tables and 461 attributes; iii) data.gov (GOV), US open government data, consisting of 2250 CSV files; and iv) MassData (MAS), the open government data from Massachusetts, from which we use 10 attributes for our outlierdetection experiment. For the outlierdetection experiments we also use the KDDCUP99 and Forest Cover datasets [21], which are standard datasets used in outlier detection. Generally, we found that the results were robust to variation in the configuration parameters of Xtructure; unless otherwise specified we set maximum branches to 3, the branching threshold to 0.1, and the capture threshold to 85%. For all the singlethreaded experiments (all except as indicated), We use a computer with a 1.7GHz Intel Core i7 and 8GB RAM.
Va Automatic Label Assignment
To automatically label columns, we need (regex, label) pairs that associate meaningful labels to the syntactic patterns described by the regexes. Given such a library (which works as well as ground truth), we can use Xsystem to learn syntactic patterns for each column in the database and compare these patterns with the regexes in the library. Then, when we find a match, we assign the label to the column. The quality of this application depends on the quality of our comparison technique, which we evaluate here.
To obtain the library of (regex, labels), we manually assigned (regex, label) pairs to more than 4,000 attributes from DWH, CHEM and GOV. The specific number of attributes with assigned labels is shown in the “# total attrs.” column of table III. The regexes are drawn from regexlib.com [4], which has a collection of generic patterns. We also added domainspecific regexes for chemical datasets. In both cases we choose the most specific regex possible. For example, for an attribute containing even numbers, we would use “\d\d(02468)” rather than “\d\d\d”.
We used Xsystem to learn the Xtructures for the 4000+ attributes and searched for the nearest regex in the regex library, using the algorithm of section VB. We compared this nearest regex to the ground truth regex we manually associated with each attribute. Table III shows that we find over 94% of correct matches for the three datasets we use. This means that we can automatically assign labels to 94% of the data, which vastly reduces the human effort that would otherwise be necessary.
Dataset  total attrs.  correct matches  % matches 

DWH  1504  1417  94% 
CHE  307  294  95.5% 
GOV  2476  2355  94.9% 
Not all matches are equally useful. For example, we find matches of columns to both InchI numbers and keys as well as to SMILES, and both of are annotated with chemical id. This vastly improves the discoverability of these attributes, helping analysts with their tasks. In other cases, the match is with a low specificity regex such as “strings” or “numbers”, which although correct is not insightful. This is an artifact of the quality of the (regex, label) pairs we had available. In the enterprise scenario, we expect registries of regexes built by domain experts to be of high quality, therefore leading to good quality label annotation of the data.
In summary, this experiment shows that Xsystem is able to propagate labels to attributes for a wide range of attribute formats when a library of (regex, label) is available.
VB Summarization and Comparison
In this experiment, our goal is to use Xsystem to find pairs of similar attributes in a large dataset; such columns often represent duplicates, or possible identifiers that can be used in joins. For this application, we obtained groundtruth data consisting of pairs of similar attributes from CHE. We collect all attributes within the dataset whose name contains “id” (eg. “tid,” “cell_id,” “tax_id”); attributes were removed and tuples shuffled in random order; a volunteer labeled pairs of these shuffled nameless attributes as syntactically similar, or different. We obtained labels for about 1000 pairs of attributes. We learn Xtructures for each attribute.
We evaluate the effectiveness of Xsystem at finding similar pairs. First we perform an allpairs comparison between the learned Xtructures using the method described in IVA1. This is
but is an intuitive method, useful when the number of attributes is small, or when we want to quickly find all IDs in a database that are syntactically similar to one preselected column ID. In this experiment, the method labels a pair of columns as similar when their similarity is above a given threshold, and then we measure precision and recall of the results, which we show as the “All Pairs” line in
6. The figure shows a good accuracy, with the method reaching an F1 score of around 0.82, and maintaining constant high precision until a recall of about 0.8.Fast Comparison As we have discussed, when the number of attributes is in the thousands—which is often the case in large enterprises—an allpairs comparison becomes too expensive. To resolve this, we implemented a variation of our method which uses the “set signature” approach described in Section IVB. When using this approach, Xtructures are clustered based on the approximate Jaccard distance between their signature sets. These clusters were then translated into pair labellings, giving an time algorithm. The precision and recall results for the same experiment using this method is shown on the “MinHash LSH” line of 6. The figure shows that the quality of this alternative method is in fact similar to the allpairs one, with the curve shapes looking similar. The slight irregularities in the curve (lack of smoothness) at high recalls are due to the clusterbased nature of the LSH method, rather than direct comparison of each pair of attributes. This makes sense because since we must pregenerate strings to generate the MinHash signature, we make sure the strings uniformly represent the underlying data, therefore increasing the signature quality. We further explore the details of the performance tradeoff of these two methods in the microbenchmark in section VD.
Qualitative Analysis When the techniques yield errors, we found them to be quite intuitive. For example, one common error we found was due to irregularities in the data, such as two similar attributes not being detected because one used “nan” to denote missing data, while the other used “1”. Another common kind of error came from attributes with diverse representations and implicit semantic meaning. For example, a human may label two attributes containing variable length decimal numbers as different if the mean or standard deviation of the numbers is different, which, in some cases Xsystem is not able to detect, yielding false positives.
VC SyntaxBased Outlier Detection
Next, we evaluate Xsystem’s ability to detect outliers, or anomalies, within single attributes in a dataset. For this application, we use both the MAS dataset (for which ground truth was manually collected through volunteers), and several “benchmark” datasets in the field of outlier detection, namely the KDDCUP 1999 intrusion detection dataset and the Forest Cover dataset, both obtained from the UCI Dataset Repository [21]. For quantitative analysis, we use three of the outlier types from KDDCUP, as well as the Forest Cover dataset; we then utilize the manually labeled MAS for qualitative discussion.
We learn a Xtructure per attribute from a subset of the tuples (with outliers present). We then freeze the Xtructure and feed it new tuples, obtaining a fitness score which we use to find outliers. The score is transformed into an outlier score using a weighted combination of the scoring fits (section IIIB1) of each branch. We mark outlier scores that exceed an outlier threshold. We present the resulting PR curves in 7.
The figure shows the precision and recall for different values of the outlier threshold. The results show nearperfect performance on all four of the large datasets. Since Xsystem excels with large quantities of structured data, we next perform a qualitative analysis of outlier detection using MAS, a smaller but more complex real dataset where outlier marking can actually be quite subjective; this allows us to identify the areas where Xsystem has the most difficulty.
Qualitative Analysis on MAS dataset Many of the errors made by Xsystem are ambiguous to humans; in particular, the majority of the mistakes made were in an attribute representing street address suffixes (ST, BL, AV, etc). The source of ambiguity is lowerfrequency, but still valid street suffixes, such as “BL” for boulevard, or “PL” for place, and whether or not Xsystem marked these as outlieries is simply a function of the aforementioned outlier threshold. On the positive side, the system found outliers that were indeed errors in the data, such as a ZIP code in Boston with more than 8 digits (the standard is 5 digits), or the tuple MIDNIGHT among tuples representing hours as digits.
VD Microbenchmarks


Learning Speed. Properties such as number of tuples, number of delimiters per tuple and pattern heterogeneity affect the performance of Xsystem. To measure these effects, we generated synthetic data with varying properties and then we ran Xsystem on the data. 8(a)(a, b and c) shows the running times averaged 10 times, with median, 95th and 99th percentile.
In the first experiment (a), we use a fixed length data types (country currency codes) and vary the number of tuples in the input dataset. In the second experiment (b), we measure the impact of the number of hinges, which has an effect on the number of tokens that Xsystem must maintain. Here, we create input datasets with variable number of hinges by concatenating YYYYMMDD formatted dates, and fixed the number of tuples to 1000. Finally, in the third experiment (c), we vary attribute value length using datasets with 1000 tuples and mixed data types, which has an effect on the total number of branches that per Xtructure that Xsystem maintains.
Xsystem’s performance in all 3 experiments grows linearly with the variable of interest. Although absolute numbers are higher in the third experiment, the overall runtimes are still small, with 99th percentile learning times below ten seconds in all cases, and median generally times less than a second.
Parallel Scalability. To understand the parallel scaling of Xsystem, we generated data (about 20000 alphanumeric identifiers) and learned a Xtructure using a different number of cores. We use Xsystem with set to 7 and measure the time the learning process takes. Since the data is highly regular and Xsystem only consumes a few samples before acheiving 95% confidence and stopping early (as described in section in IIIB5), we disable early stopping in order to accurately demonstrate the effects of parallelization.
8(a) (d) shows our results. As expected, adding parallelism reduces the total runtime up to the maximum number of hardware cores available in the experimental machine. The system does not scale perfectly linearly after 8 cores due to overheads during the merging stage of our algorithm, which could further be reduced through optimizations, including hierarchical parallelization of the merging operation itself.
AllPairs vs MinHash Comparison. Next we seek to understand the difference in running time between the allpairs and MinHash LSH methods for Xtructure comparison, as was mentioned in attribute comparison experiment in VB. The experiment shows that in terms of results the two methods perform very similarly in terms of accuracy on a real dataset; in terms of runtime, we hypothesized that allpairs method would be quadratic in the number of attributes being compared pairwise, whereas MinHash LSH should be linear. However, we also expected there to be lower overhead in the allpairs method for lower numbers of attributes. To verify this, we generate datasets of uniform column length, but with varying numbers of columns. Xtructures are learned for each column (untimed), and then every pair of columns is compared using both the allpairs method and the MinHash LSH method. As 7(a) depicts, for up to about 10 columns (which corresponds to on the order of 100 comparisons), the allpairs method outperforms MinHash LSH due to its low overhead, however for larger numbers of attributes, the MinHash LSH method is superior in that it scales linearly.
Invariance to Tuple Ordering. Finally, we wish to measure the invariance of Xtructure to the random shuffling of tuples. To do this, we take three different synthetic attributes of various complexities, representing IP Address, Title, and Latin Word. Each attribute contains 1000 tuples, and these are shuffled in 20 distinct ways. Table IV indicates the variation across the unique shufflings, both in fitting time, and in the “fitness score” against the source column. The results show robustness against bad orderings of tuples within an attribute.
IP Addresses  Titles  Latin Words  

Example  159.112.55.237  Dr.  cupiditate 
Mean (ms/line)  0.27  0.23  1.0 
Stdev (ms/line)  0.07  0.05  0.3 
Mean score  0.18  0.25  0.41 
Stdev score  0.02  0.06  0.12 
Vi Related Work
In this section we discuss our contributions in the context of several techniques and research areas related to Xsystem.
Information Extraction Xsystem is related to the vast literature of information extraction (IE) in that it extracts a structural representation from data. Most IE techniques have been proposed to extract structured from totally unstructured data, such as text, or semistructured data, such as XML and HTML. In addition, most of those techniques require variable amount of human input. Our approach differs in that it must work automatically and it operates on structured data, producing one succint pattern that represents the syntactic structured of a collection of input strings. We think of Xsystem then, as a technique related to IE that complements the existing techniques and achieves good performance in important applications to large enterprises.
Regular expression inference This is the set of techniques most directly related to Xsystem; they can be used to extract syntactical patterns from collections of strings. The most recent work uses multiobjective optimization and aggressive space pruning to reduce the running time [13, 14] of the inference process. Performance is still an issue for the method to be used in enterprise settings as their evaluation shows—more than 40 min for learning a dataset with 500 entries with 32 threads. In contrast, we have advocated and demonstrated the better fit of Xsystem, which reduces the unnecessary expressiveness of regular expressions to gain in performance. Xtructure learns patterns over a comparable data set on a singlecore, singlethread in significantly less time.
Other methods can be divided into whether negative examples are required or not. Those that require negative examples are rarely suitable in enterprise settings. Out of systems that only require positive examples, [15], [8] and [2] are the most relevant. With [15] we share our treatment of input characters as their character class (referred to as token class in their case) to produce a higher level abstraction of input data. Their method learns a cyclic DFA, while we have demonstrated that this expressiveness is not necessary, and propose a more efficient learning algorithm. Lastly, ReLIE [8] requires example regular expressions, that are then further refined. We differ in that we operate without human input. Unlike all this work, we focus on: i) designing Xsystem to capture syntactical patterns in databases, and not to solve the general – and more complex – problem of learning regular expressions for infinite languages; ii) support efficient comparison of the learned patterns, which we have shown helps in identifying syntactically similar content, automatically labelling data, and identifying syntactic outliers.
Program SynthesisProgram synthesis based methods have seen a surge in popularity [12, 22, 23]. Unlike Xsystem, their goal is typically to operate and transform data, for example for data cleaning. This means that the complexity of the structured they need to build and maintain internally is higher than that of Xsystem. For example, BlinkFill [24] must build an InputDataGraph to then transform the data that is more expensive to build than Xsystem, and unnecessary for our goal. Other techniques, such as [12, 22] require negative examples and differ again from our automatic technique.
Other approaches Last, approaches such as SystemT [25] assume a humanintheloop, infeasible in scenarios that require unattended operation. Potter’s Wheel [26] relies on dictionaries of predefined structures to identify data errors and perform data transformations. Although we have demonstrated that Xsystem can be efficiently used to detect syntactic outliers, we do not share goals and our methods to capture such patterns are different: both techniques are orthogonal.
Vii Conclusions
We presented Xsystem, a method for learning syntactic structure of data as Xtructure. While patterns in Xsystem are less expressive than regexes, they are orders of magnitude faster to learn, making it possible to learn patterns for thousands of attributes which commonly occur in large databases. Furthermore, Xsystem performs well at a number of database tasks, including automatic label assignment, column summarization and comparison as well as outlier detection.
References
 [1] S. Heller, A. McNaught et al., “InChI  the worldwide chemical structure identifier standard,” Journal of Cheminformatics, 2013.
 [2] H. Fernau, “Algorithms for Learning Regular Expressions from Positive Data,” Information and Computation, 2009.
 [3] E. M. Gold, “Complexity of automaton identification from given data,” Information and Control, 1978.
 [4] regexlib, “Regular expression library,” http://www.regexlib.com, 2016, [accessed 30 Dec 2016].
 [5] A. Arasu et al., “Efficient Exact Setsimilarity Joins,” in VLDB, 2006.
 [6] V. Satuluri and S. Parthasarathy, “Bayesian Locality Sensitive Hashing for Fast Similarity Search,” VLDB, 2012.
 [7] Z. Abedjan, X. Chu et al., “Detecting Data Errors: Where Are We and What Needs to Be Done?” VLDB, 2016.
 [8] Y. Li, R. Krishnamurthy et al., “Regular Expression Learning for Information Extraction,” in EMNLP, 2008.
 [9] D. Angluin, “Learning regular sets from queries and counterexamples,” Inf. Comput., 1987.

[10]
B. D. Dunay, F. E. Petry et al.
, “Regular language induction with genetic programming,” in
WCCI, 1994.  [11] A. Brāzma, “Efficient Identification of Regular Expressions from Representative Examples,” in COLT, 1993.
 [12] M. Lee, S. So et al., “Synthesizing Regular Expressions from Examples for Introductory Automata Assignments,” in GPCE, 2016.
 [13] A. Bartoli, G. Davanzo et al., “Automatic Synthesis of Regular Expressions from Examples,” IEEE Computer Society Press, 2014.
 [14] A. Bartoli, A. De Lorenzo et al., “Inference of Regular Expressions for Text Extraction from Examples,” IEEE TKDE, 2016.
 [15] F. Brauer, R. Rieger et al., “Enabling Information Extraction by Inference of Regular Expressions from Sample Entities,” in CIKM, 2011.
 [16] L. Pitt et al., “The Minimum Consistent DFA Problem Cannot Be Approximated Within Any Polynomial,” J. ACM, 1993.

[17]
M. Sipser,
Introduction to the Theory of Computation
, 1st ed. International Thomson Publishing, 1996.  [18] A. Gionis, P. Indyk et al., “Similarity Search in High Dimensions via Hashing,” in VLDB, 1999.
 [19] A. Broder, “On the Resemblance and Containment of Documents,” in SEQUENCES, 1997.
 [20] A. Gaulton, L. J. Bellis et al., “ChEMBL: a largescale bioactivity database for drug discovery,” Nucleic Acids Research, 2012.

[21]
M. Lichman, “UCI machine learning repository,” 2013.
 [22] J. K. Feser, S. Chaudhuri et al., “Synthesizing Data Structure Transformations from Inputoutput Examples,” PLDI, 2015.
 [23] V. Le and S. Gulwani, “FlashExtract: A Framework for Data Extraction by Examples,” in PLDI, 2014.
 [24] R. Singh, “Blinkfill: Semisupervised programming by example for syntactic string transformations,” PVLDB, 2016.
 [25] R. Krishnamurthy et al., “Systemt: A system for declarative information extraction,” SIGMOD Rec., Mar. 2009.
 [26] V. Raman and J. M. Hellerstein, “Potter’s Wheel: An Interactive Data Cleaning System,” in VLDB, 2001.