1. Introduction
Complex data pipelines and data flows are increasingly common in today’s BI, ETL and ML systems (e.g., in Tableau (tableauflow), Power BI (powerbiflow), Amazon Glue (amazonworkflow), Informatica (Informatica), Azure ML (azuremlpipelines), etc.). These pipelines typically recur regularly (e.g., daily), as BI reports need to be refreshed regularly (dayal2009data; schelter2018automating), data warehouses need to be updated frequently (vassiliadis2009near), and ML models need to be retrained continuously (to prevent model degradation) (breck2019data; polyzotis2017data).
However, it is widely recognized (e.g., (breck2019data; hynes2017data; polyzotis2017data; schelter2018automating; schelter2019unit; schelter2019differential)) that in recurring production data pipelines, over time upstream data feeds can often change in unexpected ways. For instance, over time, data columns can be added to or removed in upstream data, creating schemadrift (breck2019data; polyzotis2017data; schelter2018automating). Similarly, datadrift (polyzotis2017data) is also common, as the formatting standard of data values can change silently (e.g., from “enus” to “enUS”, as reported in (polyzotis2017data)); and invalid values can also creep in (e.g., “en99”, for unknown locale).
Such schemadrift and datadrift can lead to quality issues in downstream applications, which are both hard to detect (e.g., modest model degradation due to unseen data (polyzotis2017data)), and hard to debug (finding rootcauses in complex pipelines can be difficult). These silent failures require substantial human efforts to resolve, increasing the cost of running production data pipelines and decreasing the effectiveness of datadriven system (breck2019data; polyzotis2017data; schelter2018automating; schelter2019unit).
Data Validation by Declarative Constraints.
Large tech companies operating large numbers of complex pipelines are the first to recognize the need to catch data quality issues early in the pipelines. In response, they pioneered a number of “data validation” tools, including Google’s TensorFlow Data Validation (TFDV)
(breck2019data; TFDV) and Amazon’s Deequ (Deequ; schelter2018automating), etc. These tools develop easytouse domain specific languages (DSLs), for developers and data engineers to write declarative data constraints that describe how “normal” data should look like in the pipelines, so that any unexpected deviation introduced over time can be caught early as they arise.Figure 1 shows an example code snippet^{1}^{1}1https://aws.amazon.com/blogs/bigdata/testdataqualityatscalewithdeequ/ from Amazon Deequ (schelter2018automating) to specify data validation constraints on a table. In this case, it is declared that values in the column “review_id” need to be unique, column “marketplace” is complete (with no NULLs), values in “marketplace” need to be in a fixed dictionary {“US”, “UK”, “DE”, “JP”, “FR”}, etc. Such constraints will be used to validate against data arriving in the future, and raise alerts if violations are detected. Google’s TFDV uses a similar declarative approach with a different syntax.
These approaches allow users to provide highlevel specifications of data constraints, and is a significant improvement over lowlevel assertions used in adhoc validation scripts (which are are hard to program and maintain) (breck2019data; TFDV; Deequ; schelter2018automating; swami2020data).
Automate Data Validation using Inferred Constraints. While declarative data validation clearly improves over adhoc scripts, and is beneficial as reported by both Google and Amazon (breck2019data; schelter2018automating), the need for data engineers to manually write data constraints onecolumnatatime (e.g., in Figure 1) still makes it timeconsuming and hard to scale. This is especially true for complex production data, where a single data feed can have hundreds of columns requiring substantial manual work. As a result, experts today can only afford to write validation rules for part of their data (e.g., important columns). Automating the generation of data validation constraints has become an important problem.
The main technical challenge here is that validation rules have to be inferred using data values currently observed from a pipeline, but the inferred rules have to apply to future data that cannot yet be observed (analogous to trainingdata/testingdata in ML settings where testdata cannot be observed). Ensuring that validation rules generalize to unseen future data is thus critical.
We should note that existing solutions already make efforts in this direction – both Google TFDV and Amazon Deequ can scan an existing data feed, and automatically suggest validation constraints for human engineers to inspect and approve. Our analysis suggests that the capabilities of these existing solutions are still rudimentary, especially on stringvalued data columns. For instance, for the example in Figure 2(a) with date strings in Mar 2019, TFDV would infer a validation rule^{2}^{2}2This was initially tested on TFDV version 0.15.0, the latest version available in early 2020. We tested it again at the time of publication in March 2021 using the latest TFDV version 0.28.0, and observed the same behavior for this test case. requiring all future values in this column to come from a fixed dictionary with existing values in : {“ Mar 01 2019”, …“ Mar 30 2019” }. This is too restrictive for data validation, as it can trigger falsealarms on future data (e.g. values like “Apr 01 2019”). Accordingly, TFDV documentation does not recommend suggested rules to be used directly, and “strongly advised developers to review the inferred rule and refine it as needed”^{3}^{3}3https://www.tensorflow.org/tfx/data_validation/get_started.
When evaluated on production data (without human intervention), we find TFDV produces falsealarms on over 90% stringvalued columns. Amazon’s Deequ considers distributions of values but still falls short, triggering falsealarms on over 20% columns. Given that millions of data columns are processed daily in production systems, these falsepositive rates can translate to millions of falsealarms, which is not adequate for automated applications.
AutoValidate stringvalued data using patterns. We in this work aim to infer accurate datavalidation rules for stringvalued data, using regexlike patterns. We perform an indepth analysis of production data in real pipelines, from a large MapReducelike system used at Microsoft (patel2019big; zhou2012scope), which hosts over 100K production jobs each day, and powers popular products like Bing Search and Office. We crawl a sample of 7M columns from the data lake hosting these pipelines and find stringvalued columns to be prevalent, accounting for 75% of columns sampled.
Figure 3 shows 7 example stringvalued columns from the crawl. A key characteristic common to many such columns, is that these columns have homogeneous and machinegenerated data values, which exhibit distinct “patterns” that encode specific semantic meanings. In our examples, these include knowledgebase entityid (BingEntity) (used by Bing Search), onlineads delivery status (used by search ads), timestamps in proprietary formats (used by search ads), etc. We note that these are not oneoff patterns occurring just once or twice – each pattern in Figure 3 shows up in at least 5000 data columns in our crawl, suggesting that these are widelyused concepts. Similar data are likely common in other industries and domains (e.g., for pharmaceutical or financial companies (stonebraker2018data)).
In this work, we focus on these stringvalued columns that have homogeneous machinegenerated data values. Specifically, if we could infer suitable patterns to describe the underlying “domain” of the data (defined as the space of all valid values), we can use such patterns as validation rules against data in the future. For instance, the pattern “<letter>{3} <digit>{2} <digit>{4}” can be used to validate in Figure 2(a), and “<digit>+/<digit>{2}/<digit>{4} <digit>+:<digit>{2}:<digit>{2} <letter>{2}” for in Figure 2(b).
Our analysis on a random sample of 1000 columns from our production data lake suggests that around 67% stringvalued columns have homogeneous machinegenerated values (like shown in Figure 3), whose underlying datadomains are amenable to patternbased representations (the focus of this work). The remaining 33% columns have naturallanguage (NL) content (e.g., company names, department names, etc.), for which patternbased approaches would be less wellsuited. We should emphasize that for large datalakes and production data pipelines, machinegenerated data will likely dominate (because machines are after all more productive at churning out large volumes of data than humans producing NL data), making our proposed technique widely applicable.
While on the surface this looks similar to prior work on patternprofiling (e.g., Potter’s Wheel (raman2001potter)
and PADS
(fisher2005pads)), which also learn patterns from values, we highlight the key differences in their objectives that make the two entirely different problems.Data Profiling vs. Data Validation. In patternbased data profiling, the main objective is to “summarize” a large number of values in a column , so that users can quickly understand content is in the column without needing to scroll/inspect every value. As such, the key criterion is to select patterns that can succinctly describe given data values in only, without needing to consider values not present in .
For example, existing pattern profiling methods like Potter’s Wheel (raman2001potter) would correctly generate a desirable pattern “Mar <digit>{2} 2019” for in Figure 2(a), which is valuable from a patternprofiling’s perspective as it succinctly describes values in . However, this pattern is not suitable for datavalidation, as it is overlyrestrictive and would trigger falsealarms for values like “Apr 01 2019” arriving in the future. A more appropriate datavalidation pattern should instead be “<letter>{3} <digit>{2} <digit>{4}”. Similarly, for , Figure 2(b) shows that a pattern ideal for profiling is again very different from one suitable for datavalidation.
Conceptually, there are many ways to “generalize” a column of data into patterns. For a simple column of datetime strings like in Figure 5, and using a standard generalization hierarchy (Figure 4), one could generate over 3 billion patterns. For example, the first part (digit “9” for month) alone can be generalized in 7 different ways below, also shown in the topright box of Figure 5.

(1) a constant “9” (Const(‘‘9’’)),

(2) one single digit (<digit>{1}),

(3) at least one digits (<digit>+),

(4) any number including floatingpoints (<num>),

(5) one alphanumeric character (<alphanum>),

(6) at least one alphanumeric characters (<alphanum>+),

(7) root of the hierarchy (<all>)
The crossproduct of these options at each position creates a large space of patterns (about 3.3 billion) for this simple column.
The challenge for datavalidation, is to select suitable patterns from this large space given only observed values in , in anticipation of valid values from the same “domain” that may arrive in the future. The key is to not use overly restrictive patterns (as in dataprofiling), which would yield many falsepositive alerts. Conversely overly general patterns (like the trivial “.*”) should also not be used, as they would fail to detect any data quality issues. In effect, we want to find patterns that balance two conflicting goals: (1) reduce falsealarms (or improve detection precision); and (2) catch as many issues as possible (or improve detection recall).
A corpusdriven approach to patterninference. Intuitively, the patterninference problem is hard if we only look at one input column – for the datetime examples in Figure 2, we as humans know what patterns are suitable; but for the examples in Figure 3 drawn from proprietary domains, even humans may find it hard to determine suitable validation patterns, and would need to seek “additional evidence”, such as other “similarlooking columns”.
Following this intuition, we in this work formulate patterninference as optimization problems, by leveraging a large corpus of related tables (e.g. data produced by production pipelines and dumped in the same enterprise data lake). Our inference algorithms leverage columns “similar” to in , to reason about patterns that may be ideal for datavalidation.
Largescale evaluations on production data suggest that AutoValidate produces constraints that are substantially more accurate than existing methods. Part of this technology ships as an AutoTag feature in Microsoft Azure Purview (purview).
2. AutoValidate (Basic Version)
We start by describing a basicversion of AutoValidate for ease of illustration, before going into more involved algorithm variants that use verticalcuts (Section 3) and horizontalcuts (Section 4).
2.1. Preliminary: Pattern Language
Since we propose to validate data by patterns, we first briefly describe the pattern language used. We note that this is fairly standard (e.g., similar to ones used in (raman2001potter)) and not the focus of this work.
Figure 4 shows a generalization hierarchy used, where leafnodes represent the English alphabet, and intermediate nodes (e.g., <digit>, <letter>) represent tokens that values can generalize into. A pattern is simply a sequence of (leaf or intermediate) tokens, and for a given value , this hierarchy induces a space of all patterns consistent with , denoted by . For instance, for a value “9:07”, we can generate {“<digit>:<digit>{2}”, “<digit>+:<digit>{2}”, “<digit>:<digit>+”, “<num>:<digit>+”, “9:<digit>{2}”, …}, among many other options.
Given a column for which patterns need to be generated, we define the space of hypothesispatterns, denoted by , as the set of patterns consistent with all values , or (note that we exclude the trivial “.*” pattern as it is trivially consistent with all ). Here we make an implicit assumption that values in are homogeneous and drawn from the same domain, which is generally true for machinegenerated data from production pipelines (like shown in Figure 3)^{4}^{4}4Our empirical sample in an enterprise data lake suggests that 87.9% columns are homogeneous (defined as drawn from the same underlying domain), and 67.6% are homogeneous with machinegenerated patterns.. Note that the assumption is used to only simplify our discussion of the basicversion of AutoValidate (this section), and will later be relaxed in more advanced variants (Section 4).
We call hypothesispatterns because as we will see, each will be tested like a “hypothesis” to determine if is a good validation pattern. We use and interchangeably when the context is clear.
Example 1 ().
We use an inhouse implementation to produce patterns based on the hierarchy, which can be considered as a variant of wellknown patternprofiling techniques like (raman2001potter). Briefly, our patterngeneration works in two steps. In the first step, it scans each cell value and emits coarsegrained tokens (e.g., <num> and <letter>+), without specializing into finegrained tokens (e.g., <digit>{1} or <letter>{2}). For each value in Figure 5, for instance, this produces the same <num>/<num>/<num> <num>:<num>:<num> <letter>+. Each coarsegrained pattern is then checked for coverage so that only patterns meeting a coverage threshold are retained. In the second step, each coarsegrained pattern is specialized into finegrained patterns (examples are shown in Figure 5), as long as the given threshold is met. Pseudocode of the procedure can be found in Algorithm 1 below.
We emphasize that the proposed AutoValidate framework is not tied to specific choices of hierarchy/patternlanguages – in fact, it is entirely orthogonal to such choices, as other variants of languages are equally applicable in this framework.
2.2. Intuition: Goodness of Patterns
The core challenge in AutoValidate, is to determine whether a pattern is suitable for datavalidation. Our idea here is to leverage a corpus of related tables (e.g., drawn from the same enterprise data lake, which stores input/output data of all production pipelines). Intuitively, a pattern is a “good” validation pattern for , if it accurately describes the underlying “domain” of , defined as the space of all valid data values. Conversely, is a “bad” validation pattern, if:

(1) does not capture all valid values in the underlying “domain” of ;

(2) does not produce sufficient matches in .
Both are intuitive requirements – (1) is desirable, because as we discussed, the danger in datavalidation is to produce patterns that do not cover all valid values from the same domain, which leads to false alarms. Requirement (2) is also sensible because we want to see sufficient evidence in before we can reliably declare a pattern to be a common “domain” of interest (seeing a pattern once or twice is not sufficient).
We note that both of these two requirements can be reliably tested using
alone and without involving humans. Testing (2) is straightforward using standard patternmatches. Testing (1) is also feasible, under the assumption that most data columns in
contain homogeneous values drawn from the same underlying domain, which is generally true for “machinegenerated” data from production pipelines. (Homogeneity is less likely to hold on other types of data, e.g., data manually typed in by human operators).We highlight the fact that we are now dealing with two types of columns that can be a bit confusing – the first is the input column for which we need to generate validation patterns , and the second is columns from which we draw evidence to determine the goodness of . To avoid confusion, we will follow the standard of referring to the first as the query column, denoted by , and the second as data columns, denoted by .
We now show how to test requirement (1) using below.
Example 2 ().
The left of Figure 6 shows a query column for which validation patterns need to be generated. A few hypothesispatterns in are listed in the middle. In this example, we intuitively know that and are “bad” patterns since they are too “narrow” and would lead to falsealarms, while is a “good” validation pattern that suitably generalizes the underlying domain.
We show that such determinations can be inferred using . Specifically, is a “bad” pattern based on , because we can find many columns like shown on the right of Figure 6 that are “impure” – these columns contain values that match , as well as values that do not (e.g., “9/12/2019 12:01:32 PM”, where the “PM” part does not match ). A large number of “impure” columns like this would indicate that is not a good pattern to describe the underlying domain because if it were, we should not see many impure columns like given homogeneity.
Similarly, we can show that is not a good validation pattern, for using to describe the domain would again make many columns like “impure” (values like “10:02:20 AM” are inconsistent with since they have twodigit hours, whereas uses a single <digit> for hours), suggesting is also too narrow.
Using the ideal pattern to describe the domain, on the other hand, would yield few “impure” columns in .
Intuitively, we can use the impurity of pattern on data columns , to infer whether is a good validation pattern for the domain, defined as follows.
Definition 3 ().
The impurity of a data column , relative to a given hypothesis pattern , is defined as:
(1) 
This definition is intuitive – we measure impurity as the fraction of values in not matching .
Example 4 ().
In Figure 6, can be calculated as , since the last 2 values (with “PM”) out of 12 do not match .
Similarly, can be calculated as , since the last 8 values in (with twodigit hours) do not match .
Finally, is , since all values in match .
We note that if is used to validate a data column (in the same domain as ) arriving in the future, then Imp directly corresponds to expected falsepositiverate (FPR):
Definition 5 ().
The expected falsepositiverate (FPR) of using pattern to validate a data column drawn from the same domain as , denoted by , is defined as:
(2) 
Where FP and TN are the numbers of falsepositive detection and truenegative detection of on , respectively.
Note that since is from the same domain as , ensuring that TN + FP = , can be rewritten as:
(3) 
Thus allowing us to estimate
using .Example 6 ().
Continue with Example 4, it can be verified that the expected FPR of using as the validation pattern for , directly corresponds to the impurity – e.g., using to validate has = = ; while using to validate has , etc.
Based on FPR defined for one column , we can in turn define the estimated FPR on the entire corpus , using all column where some value matches :
Definition 7 ().
Given a corpus , we estimate the FPR of pattern on , denoted by FPR, as:
(4) 
Example 8 ().
We emphasize that a low FPR is critical for datavalidation, because given a large number of columns in production systems, even a 1% FPR translates into many falsealarms that require human attention to resolve. FPR is thus a key objective we minimize.
2.3. Problem Statement
We now describe the basic version of AutoValidate as follows. Given an input query column and a background corpus , we need to produce a validation pattern , such that is expected to have a low FPR but can still catch many quality issues. We formulate this as an optimization problem. Specifically, we introduce an FPRMinimizing version of DataValidation (FMDV), defined as:
(5)  
(6)  s.t.  
(7) 
Where is the expected FPR of estimated using , which has to be lower than a given target threshold (Equation (6)); and is the coverage of , or the number of columns in that match , which has to be greater than a given threshold (Equation (7)). Note that these two constraints directly map to the two “goodness” criteria in Section 2.2.
The validation pattern we produce for is then the minimizer of FMDV from the hypothesis space (Equation (5)). We note that this can be interpreted as a conservative approach that aims to find a “safe” pattern with minimum FPR.
Example 9 ().
Suppose we have FMDV with a target FPR rate no greater than , and a required coverage of at least . Given the 3 example hypothesis patterns in Figure 6, we can see that patterns and are not feasible solutions because of their high estimated FPR. Pattern has an estimated FPR of 0.04% and a coverage of 5000, which is a feasible solution to FMDV.
Lemma 10 ().
In a simplified scenario where each column in has values drawn randomly from exactly one underlying domain, and each domain in turn corresponds to one groundtruth pattern. Then given a query column for which a validation pattern needs to be generated, and a corpus with a large number () of columns generated from the same domain as
, with high probability (at least
), the optimal solution of FMDV for (with ) converges to the groundtruth pattern of .Proof Sketch: We show that any candidate patterns that “undergeneralizes” the domain will with high probability be pruned away due to FPR violations (given ). To see why this is the case, consider a pattern produced for that undergeneralizes the groundtruth pattern . Note is picked over only when not a single value in all columns in the same domain as drawn from the space of values defined by are not in (since is set to 0). Because these columns are drawn randomly from and each column has at least one value, the probability of this happening is thus at most ( is used because there is at least one branch in that is not in ), making the overall probability of success to be . ∎
We also explored alternative formulations, such as coverageminimizing datavalidation (CMDV), where we minimize coverage in the objective function in place of FPR as in FMDV. We omit details in the interest of space, but will report that the conservative FMDV is more effective in practice.
A dual version of FMDV can be used for automated datatagging, where we want to find the most restrictive (smallest coverage) pattern to describe the underlying domain (which can be used to “tag” related columns of the same type), under the constraint of some target falsenegative rate. We describe this dual version in (autotagtr).
2.4. System Architecture
While FMDV is simple overall, a naive implementation would require a full scan of to compute and for each hypothesis . This is both costly and slow considering the fact that is typically large in enterprise data lakes (e.g., in terabytes), where a fullscan can take hours. This is especially problematic for userintheloop validation like in Google’s TFDV, where users are expected to review/modify suggested validation rules and interactive response time is critical.
We thus introduce an offline index with summary statistics of , so that at online time and given a new query column , we can evaluate hypothesis patterns efficiently using only the index, without needing to scan again.
The architecture of our system can be seen in Figure 7.
Offline stage. In the offline stage, we perform one full scan of , enumerating all possible patterns for each as , or the union of patterns for all (where is the generalization hierarchy in Section 2.1).
Note that the space of can be quite large, especially for “wide” columns with many tokens as discussed in Figure 5. To limit , in the basic version we produce as , where is the number of tokens in (defined as the number of consecutive sequences of letters, digits, or symbols in ), and is some fixed constant (e.g., 8 or 13) to make tractable. We will describe how columns wider than can be safely skipped in offline indexing without affecting result quality, using a verticalcut technique (in Section 3).
For each pattern , we precompute the local impurity score of on as , since may be hypothesized as a domain pattern for some query column in the future, for which this can provide a piece of local evidence.
Let be the space of possible patterns in . We precompute for all possible on the entire , by aggregating local scores using Equation (3) and (4). The coverage scores of all can be precomputed similarly as .
The result from the offline step is an index for lookup that maps each possible , to its precomputed and values. We note that this index is many orders of magnitude smaller than the original – e.g., a 1TB corpus yields an index less than 1GB in our experiments.
Online stage. At online query time, for a given query column , we enumerate , and use the offline index to perform a lookup to retrieve and scores for FMDV, without needing to scan the full again. This indexing approach reduces our latency per querycolumn from hours to tens of milliseconds, which makes it possible to do interactive, humanintheloop verification of suggested rules, as we will report in experiments.
3. AutoValidate (Vertical Cuts)
So far we only discussed the basicversion of AutoValidate. We now describe formulations that handle more complex and real situations: (1) composite structures in query columns (this section); and (2) “dirty” values in query columns that violate the homogeneity assumption (Section 4).
Figure 8 shows a real column with composite structures that are concatenated from atomic domains. Intuitively, we can see that it consists of at least 4 “subdomains”, a floatingnumber (“0.1”), followed by two timestamps, followed by a status message (“OnBooking”). We observe that such columns are common in machinegenerated data, where many are complex with over 10 subdomains.
Such complex columns pose both quality and scalability challenges. In terms of quality, if we were to directly test complex patterns against like in FMDV, we may not be able to see many exact matches in , because the complex patterns in may be concatenated in adhoc ways and are sparsely represented in . As a result, we may fail to find a feasible solution to FMDV, because of the coverage requirement in Equation (7).
The second challenge is scalability. As column become “wider” with more tokens/segments, the space of possible patterns that we will have to enumerate grows exponentially in the number of tokens. This is an issue both at online query time for query column , as well as at offline indexing time to enumerate for data column – recall that the datetime example in Figure 5 already produces over 3 billion patterns; for the column in Figure 8 the space of patterns is simply impractically large to enumerate. In the offline indexing step in Section 2.4, we discussed that we intentionally omit wide columns with more than tokens to make pattern enumeration tractable. We will show here that our horizontalcut algorithm can “compensate” such omission without affecting result quality.
A natural approach to complex query column is to vertically “cut” it into subcolumns (like shown in Figure 8). Patterns for each subcolumn can then be generated in turn using FMDV. The benefit of this divideandconquer approach is that each subdomain is likely wellrepresented in , allowing them to be reliably inferred from . Furthermore, the cost of patternenumeration in offlineindexing becomes significantly smaller, as each subdomain can be enumerated separately.
Specifically, we first use a lexer to tokenize each into coarsegrained tokenclass (<symbol>, <num>, <letter>), by scanning each from left to right and “growing” each token until a character of a different class is encountered. In the example of Figure 8, for each row, we would generate an identical tokensequence “[<num><num>/<num>/<num> <num>:<num>:<num> …<letter>]”.
We then perform multisequence alignment (MSA) (carrillo1988multiple) for all token sequences, before actually performing verticalcuts. Recall that the goal of MSA is to find optimal alignment across all input sequences (using objective functions such as sumofpair scores (just2001computational)). Since MSA is NPhard in general (just2001computational), we follow a standard approach to greedily align one additional sequence at a time. We note that for homogeneous machinegenerated data, this often solves MSA optimally.
Example 1 ().
For the column in Figure 8, each value has an identical tokensequence with 29 tokens: “[<num><num>/<num>/<num> <num>:<num>:<num> …<letter>]”, and the aligned sequence using MSA can be produced trivially as just a sequence with these 29 tokens (no gaps).
After alignment, characters in each would map to the aligned sequence of length (e.g., “0.1” maps to the first <num> in the aligned token sequence, “02” maps to the second <num>, etc.). Recall that our goal is to vertically split the tokens into segments so that values mapped to each segment would correspond to a subdomain as determined by FMDV. We define an segmentation of as follows.
Definition 2 ().
Given a column with aligned tokens, define as a segment of that starts from token position and ends at position , with . An segmentation of , is a list of such segments that collectively cover , defined as: , where , and . We write as for short, and an segmentation as for short.
Note that each can be seen just like a regular column “cut” from the original , with which we can invoke the FMDV. Our overall objective is to jointly find: (1) an segmentation from all possible segmentation, denoted by ; and (2) furthermore find appropriate patterns for each segment , such that the entire can be validated with a low expected FPR. Also note that because we limit pattern enumeration in offlineindexing to columns of at most tokens, we would not find candidate patterns longer than in offlineindex, and can thus naturally limit the span of each segment to at most (, ). We can write this as an optimization problem FMDVV defined as follows:
(8)  
(9)  s.t.  
(10) 
In FMDVV, we are optimizing over all possible segmentation , and for each all possible hypothesis patterns . Our objective in Equation (8) is to minimize the sum of for all . We should note that this corresponds to a “pessimistic” approach that assumes nonconforming rows of each are disjoint at a rowlevel, thus needing to sum them up in the objective function. (An alternative could “optimistically” assume nonconforming values in different segments to overlap at a rowlevel, in which case we can use max in place of the sum. We find this to be less effective and omit details in the interest of space).
We can show that the minimum FRP scores of segmentation optimized in Equation (8) have optimal substructure (cormen2009introduction), which makes it amenable to dynamic programming (without exhaustive enumeration). Specifically, we can show that the following holds:
(11) 
Here, minFRP is the minimum FRP score with vertical splits as optimized in Equation (8), for a subcolumn corresponding to the segment from tokenposition to . Note that corresponds to the original query column , and minFRP is the score we want to optimize. Equation (11) shows how this can be computed incrementally. Specifically, minFRP is the minimum of (1) , which is the FRP score of treating as one column (with no splits), solvable using FMDV; and (2) , which is the best minFRP scores from all possible twoway splits that can be solved optimally in polynomial time using bottomup dynamicprogramming.
Example 3 ().
Continue with Example 1. The aligned sequence of column in Figure 8 has 29 tokens, corresponding to . The subsequence maps to identical values “02/18/2015 00:00:00” for all rows, which can be seen as a “subcolumn” split from .
Using Equation (11), minFRP of this subcolumn can be computed as the minimum of: (1) Not splitting of , or , computed using FMDV; and (2) Further splitting into two parts, where the best score is computed as , which can in turn be recursively computed. Suppose we compute (1) to have FPR of 0.004%, and the best of (2) to have FPR of 0.01% (say, splitting into “02/18/2015” and “00:00:00”). Overall, not splitting is the best option.
Each such can be computed bottomup, until reaching the toplevel. The resulting split minimizes the overall FPR and is the solution to FMDVV.
We would like to highlight that by using verticalcuts, we can safely ignore wide columns with more than tokens during offline indexing (Section 2.4), without affecting result quality. As a concrete example, when we encounter a column like in Figure 8, bruteforce enumeration would generate at least patterns, which is impractically large and thus omitted given a smaller tokenlength limit (e.g., 13) in offline indexing. However, when we encounter this column Figure 8 as a query column, we can still generate valid domain patterns with the help of verticalcuts (like shown in Example 3), as long as each segment has at most tokens. Verticalcuts thus greatly speed up offlineindexing as well as onlineinference, without affecting result quality.
4. AutoValidate (Horizontal Cuts)
So far we assumed the query column to be “clean” with homogeneous values drawn from the same domain (since they are generated by the same underlying program). While this is generally true, we observe that some columns can have adhoc special values (e.g., denoting null/emptyvalues) that do not follow the domainpattern of , as shown in Figure 9.
We note that examples like these are not uncommon – even for machinegenerated data, there can be a branch of logic (think of tryexcept) that handles special cases (e.g., nulls), and produces adhoc values not following the pattern of “normal” values. We henceforth refer to these as nonconforming values.
Recall that in FMDV, by assuming that the query column has homogeneous values, we select patterns from , or the intersection of patterns for all values . Such an assumption does not hold for columns in Figure 9, yielding an empty and no feasible solution to FMDV.
We thus consider a variant of AutoValidate with “horizontal cuts”, meaning that we can “cut off” a small fraction of nonconforming values in (which is identical to making patterns tolerate a small fraction of nonconforming values). We use a tolerance parameter
to control the maximum allowed fraction of nonconforming values that can be cut (e.g., 1%, 5%, etc.), which allows for a conscious tradeoff between precision and recall.
This optimization problem, termed FMDVH, is defined as
(12)  
(13)  s.t.  
(14)  
(15)  
(16) 
Like before, our objective function is to minimize FPR in Equation (12). Equation (13) shows that the hypothesis pattern is drawn from the union of possible patterns for all . The FPR and Coverage requirements in Equation (14) and Equation (15) are the same as the basic FMDV. Finally, Equation (16) requires that the selected has to match at least fraction of values in , where is the tolerance parameter above (the remaining fraction are nonconforming values).
Example 1 ().
Consider the leftmost column in Figure 9. Using FMDVH, we determine the pattern “<digit>+,<digit>+,<digit>+, <digit>+,<digit>+” to be a valid solution, as it meets the FPR and coverage requirements based on . Furthermore is consistent with 99% of values in (the remaining 1% nonconforming value is the “” marked in a red box), thus also satisfying Equation (16).
For arbitrary and generalization hierarchy, we can show that deciding whether there is a feasible solution to FMDVH is NPhard (let alone minimizing FPR), using a reduction from independent set (karp1985fast).
Theorem 2 ().
The decision version of FMDVH is NPhard.
While FMDVH is hard in general, in practice because the patterns of nonconforming values often do not intersect with those of “normal” values (as in Example 1), they create easier instances of FMDVH. Leveraging this observation, in this work we optimize FMDVH greedily, by discarding values whose patterns do not intersect with most others in . We can then find the optimal pattern for the remaining conforming values in FMDVH using FMDV.
Distributional test of nonconforming values. Given a pattern inferred from the “training data” using FMDVH, and given the data arriving in the future, our last task is to determine whether the fraction of nonconforming values in has changed significantly from .
Specifically, at “training” time, we can calculate the fraction of values in not conforming to , denoted as . At “testing” time, we also compute the fraction of values in not matching , denoted as .
A naive approach to validate is to trigger alarms if is greater than . This, however, is prone to falsepositives. Imagine a scenario where we compute the nonconforming ratio on training data to be 0.1%. Suppose on we find to be 0.11%. Raising alarms would likely be falsepositives. However, if is substantially higher, say at 5%, intuitively it becomes an issue that we should report. (The special case where no value in matches has the extreme 100%).
To reason about it formally, we model the process of drawing a conforming value vs. a nonconforming value in and
, as sampling from two binomial distributions. We then perform a form of statistical hypothesis test called
twosample homogeneity test, to determine whether the fraction of nonconforming values has changed significantly, using , , and their respective sample sizes and .Here, the nullhypothesis
states that the two underlying binomial distributions are the same; whereas states that the reverse is true. We use Fischer’s exact test (agresti1992survey) and Pearson’s ChiSquared test (kanji2006) with Yates correction for this purpose, both of which are suitable tests in this setting. We report as an issue if the divergence from is so strong such that the null hypothesis can be rejected (e.g., 0.1% and 5%).We omit details of these two statistical tests in the interest of space. In practice we find both to perform well, with little difference in terms of validation quality.
Time Complexity. Overall, the time complexity of the offline step is , which comes from scanning each column in and enumerating its possible patterns. Recall that because we use verticalcuts to handle composite domains, can be constrained by the chosen constant that is the upperlimit of tokencount considered in the indexing step (described in Section 2.4), without affecting result quality because at online inference time we compose domains using verticalcuts (Section 3).
In the online step, the complexity of the most involved FMDVVH variant is , where is the max number of tokens in query column . Empirically we find the overall latency to be less than 100ms on average, as we will report in our experiments.
5. Experiments
We implement our offline indexing algorithm in a MapReducelike system used internally at Microsoft (chaiken2008scope; zhou2012scope). The offline job processes 7M columns with over 1TB data in under 3 hours.
5.1. Benchmark Evaluation
We build benchmarks using real data, to evaluate the quality of datavalidation rules generated by different algorithms.
Data set. We evaluate algorithms using two real corpora:
Enterprise: We crawled data produced by production pipelines from an enterprise data lake at Microsoft. Over 100K daily production pipelines (for Bing, Office, etc.)epsconvertedto.pdf read/write data in the lake (zhou2012scope).
Government: We crawled data files in the health domain from NationalArchives.gov.uk, following a procedure suggested in (bogatu2020dataset) (e.g., using queries like “hospitals”). We used the crawler from the authors of (bogatu2020dataset) (code is available from (crawler)). These data files correspond to data in a different domain (government).^{5}^{5}5This government benchmark is released at https://github.com/jiesongk/autovalidate.
We refer to these two corpora as and , respectively. The statistics of the two corpora can be found in f 1.
Corpus 





Enterprise ()  507K  7.2M  8945 (17778)  1543 (7219)  
Government ()  29K  628K  305 (331)  46 (119) 
Evaluation methodology. We randomly sample 1000 columns from and , respectively, to produce two benchmark sets of querycolumns, denoted by and .^{6}^{6}6We use the first 1000 values of each column in and the first 100 values of each column in to control column size variations.
Given a benchmark with 1000 columns, , we programmatically evaluate the precision and recall of datavalidation rules generated on as follows. For each column , we use the first of values in as the “training data” that arrive first, denoted by , from which validationrules need to be inferred. The remaining of values are used as “testing data” , which arrive in the future but will be tested against the inferred validationrules.
Each algorithm can observe and “learn” datavalidation rule . The inferred rule will be “tested” on two groups of “testing data”: (1) ; and (2) .
For (1), we know that when is used to validate against , no errors should be detected because and were drawn from the same column . If it so happens that incorrectly reports any error on , it is likely a falsepositive.
For (2), when is used to validate , we know each () is likely from a different domain as , and should be flagged by . (This simulates schemadrift and schemaalignment errors that are common, because upstream data can add or remove columns). This allows us to evaluate the “recall” of each .
More formally, we define the precision of algorithm on test case , denoted by , as 1 if no value in is incorrectly flagged by as an error, and 0 otherwise. The overall precision on benchmark is the average across all cases: .
The recall of algorithm on case , denoted by , is defined as the fraction of cases in that can correctly report as errors:
(17) 
Since high precision is critical for datavalidation, if algorithm produces falsealarms on case , we squash its recall to 0. The overall recall across all cases in is then: .
In addition to programmatic evaluation, we also manually labeled each case in with an ideal groundtruth validationpattern. We use these groundtruth labels, to both accurately report precision/recall on , and to confirm the validity of our programmatic evaluation methodology.
5.2. Methods Compared
We compare the following algorithms using each benchmark , by reporting precision/recall numbers and .
AutoValidate. This is our proposed approach. To understand the behavior of different variants of FMDV, we report FMDV, FMDVV, FMDVH, as well as a variant FMDVVH which combines vertical and horizontal cuts. For FMDVH and FMDVVH, we report numbers using twotailed Fischer’s exact test with a significance level of 0.01.
Deequ (schelter2019unit). Deequ is a pioneering library from Amazon for declarative data validation. It is capable of suggesting data validation rules based on training data. We compare with two relevant rules in Deequ for stringvalued data (version 1.0.2): the CategoricalRangeRule (refer to as DeequCat) and FractionalCategoricalRangeRule (referred to as DeequFra) (Deequ), which learn fixed dictionaries and require future test data to be in the dictionaries, either completely (DeequCat) or partially (DeequFra).
TensorFlow Data Validation (TFDV) (TFDV)
. TFDV is another pioneering data validation library for machine learning pipelines in TensorFlow. It can also infer dictionarybased validation rules (similar to
DeequCat). We install TFDV via Python pip (version 0.15.0) and invoke it directly from Python.Potter’s Wheel (PWheel) (raman2001potter). This is an influential patternprofiling method, which finds the best pattern for a data column based on minimal description length (MDL).
SQL Server Integration Services (SSIS) (ssisprofiling). SQL Server has a dataprofiling feature in SSIS. We invoke it programmatically to produce regex patterns for each column.
XSystem (ilyas2018extracting). This recent approach develops a flexible branch and merges strategy to pattern profiling. We use the authors’ implementation on GitHub (xsystemcode) to produce patterns.
FlashProfile (padhi2018flashprofile). FlashProfile is a recent approach to pattern profiling, which clusters similar values by a distance score. We use the authors’ implementation in NuGet (FlashProfileCode) to produce regex patterns for each column.
Grok Patterns (Grok) (grok). Grok has a collection of 60+ manuallycurated regex patterns, widely used in log parsing and by vendors like AWS Glue ETL (Glue), to recognize common datatypes (e.g., timestamp, ipaddress, etc.). For datavalidation, we use all values in to determine whether there is a match with a known Grok pattern (e.g., ipaddress). This approach is likely highprecision but low recall because only common datatypes are curated.
SchemaMatching (rahm2001survey). Because AutoValidate leverages related tables in as additional evidence to derive validation patterns, we also compare with vanilla schemamatching that “broaden” the training examples using related tables in . Specifically, we compare with two instancebased techniques SchemaMatchingInstance1 (SMI1) and SchemaMatchingInstance10 (SMI10), which use any column in that overlaps with more than 1 or 10 instances of , respectively, as additional trainingexamples. We also compare with two patternbased SchemaMatchingPatternMajority (SMPM) and SchemaMatchingPatternPlurality (SMPP), which use as trainingdata columns in whose majoritypattern and pluralitypatterns match those of , respectively. We invoke PWheel on the resulting data, since it is the bestperforming patternprofiling technique in our experiments.
Functionaldependencyupperbound (FDUB). Functional dependency (FD) is an orthogonal approach that leverages multicolumn dependency for data quality (rahm2000data). While FDs inferred from individual table instances often may not hold in a semantic sense (berti2018discovery; papenbrock2016hybrid), we nevertheless evaluate the fraction of benchmark columns that are part of any FD from their original tables, which would be a recall upperbound for FDbased approaches. For simplicity, in this analysis, we assume a perfect precision for FDbased methods.
AutoDetectupperbound (ADUB) (huang2018auto). Autodetect is a recent approach that detects incompatibility through two common patterns that are rarely cooccurring. For a pair of values to be recognized as incompatible in Autodetect, both of and need to correspond to common patterns, which limits its coverage. Like in FDUB, we evaluate the recall upperbound of Autodetect (assuming a perfect precision).
5.3. Experimental Results
Accuracy. Figure 10(a) shows average precision/recall of all methods using the enterprise benchmark with 1000 randomly sampled test cases. Since no patternbased methods (FMDV, PWheel, SSIS, XSystem, etc.) can generate meaningful patterns (except the trivial “.*”) on 429 out of the 1000 cases (because they have substantial natural language content for example), we report results on the remaining 571 cases in Figure 10(a), where patternbased methods are applicable.
It can be seen from Figure 10(a) that variants of FMDV are substantially better than other methods in both precision and recall, with FMDVVH being the best at 0.96 precision and 0.88 recall on average. FMDVVH is better than FMDVH, which is in turn better than FMDVV and FMDV, showing the benefit of using verticalcuts and horizontalcuts, respectively.
Among all the baselines, we find PWheel and SMI1 (which uses schemamatching with 1instance overlap) to be the most competitive, indicating that patterns are indeed applicable to validating stringvalued data, but need to be carefully selected to be effective.
Our experiments on FDUB confirm its orthogonal nature to singlecolumn constraints considered by AutoValidate– the upperbound recall of FDUB only “covers” around 25% of cases handled by AutoValidate (assuming discovered FDs have perfect precision).
The two datavalidation methods TFDV and Deequ do not perform well on stringvalued data, partly because their current focus is on numericdata, and both use relatively simple logic for string data (e.g., dictionaries), which often leads to falsepositives.
Similar results can be observed on the government benchmark , which is shown in Figure 10(b). (To avoid clutter, we omit methods that are not competitive in in this figure). This benchmark is more challenging because we have a smaller corpus and less clean data (e.g., many are manuallyedited csv files), which leads to lower precision/recall for all methods. Nevertheless, FMDV methods still produce the best quality, showing the robustness of the proposed approaches on challenging data corpus.
Evaluation Method  precision  recall 
Programmatic evaluation  0.961  0.880 
Hand curated groundtruth  0.963  0.915 
For all comparisons in Figure 10
discussed above, we perform statistical tests using the Fscores between FMDVVH and all other methods on 1000 columns (where the nullhypothesis being the Fscore of FMDVVH is no better than a given baseline). We report that the
values of all comparisons range from 0.001 to 0.007 (substantially smaller than the level), indicating that we should reject the nullhypothesis and the observed advantages are likely significant.Manuallylabeled groundtruth. While our programmatic evaluation provides a good proxy of the groundtruth without needing any labeling effort, we also perform a manual evaluation to verify the validity of the programmatic evaluation.
Specifically, we manually label the groundtruth validation patterns for 1000 test cases in , and perform two types of adjustments: (1) To accurately report precision, we manually remove values in the testset of each column that should not belong to the column (e.g., occasionally columnheaders are parsed incorrectly as data values), to ensure that we do not unfairly punish methods that identify correct patterns; and (2) To accurately report recall, we identify groundtruth patterns of each test query column, so that in recall evaluation if another column is drawn from the same domain with the identical pattern, we do not count it as a recallloss.
We note that both adjustments improve the precision/recall, because our programmatic evaluation underestimates the true precision/recall. Table 2 compares the quality results using programmatic evaluation and manual groundtruth, which confirms the validity of the programmatic evaluation.
Figure 11 shows a casebycase analysis of F1 results on with competitive methods, using 100 randomly sampled cases, sorted by their results on FMDVVH to make comparison easy. We can see that FMDV dominates other methods. An erroranalysis on the failed cases shows that these are mainly attributable to advanced patternconstructs such as flexiblyformatted URLs, and unions of distinct patterns, which are not supported by our current profiler.
Sensitivity. Figure 12 shows a detailed sensitivity analysis for all FMDV variants using average precision/recall. Figure 12(a) shows the result with a varying FRP target , from the most strict 0 to a more lax 0.1. As we can see, values directly translate to a precision/recall tradeoff and is an effective knob for different precision targets. For the proposed FMDVVH variant, its performance is not sensitive for .
Figure 12(b) shows the effect of varying the coverage target from 0 to 100. We see that the precision/recall of our algorithms is not sensitive to in most cases, because the test columns we sample randomly are likely to have popular patterns matching thousands of columns. We recommend using a large (e.g., at least 100) to ensure confident domain inference.
Figure 12(c) shows the impact of varying , which is the max number of tokens in a column for it to be indexed (Section 2.4). As can be seen from the figure, algorithms using verticalcuts (FMDVV and FMDVVH) are insensitive to smaller , while algorithms without verticalcuts (FMDV and FMDVH) suffer substantial recall loss with a small , which shows the benefit of using verticalcuts. We recommend using FMDVVH with a small (e.g., 8), which is efficient and inexpensive to run.
Figure 12(d) shows the sensitivity to . We can see that the algorithm is not sensitive to , as long as it is not too small.
Pattern analysis. Because our offline index enumerates all possible patterns that can be generated from , it provides a unique opportunity to understand: (1) all common domains in independent of querycolumns, defined as patterns with high coverage and low FPR; and (2) the characteristics of all candidate patterns generated from .
For (1) we inspect the “head” patterns in the index file, e.g. patterns matching over 10K columns and with low FPR. This indeed reveals many common “domains” of interest in this data lake, some examples of which are shown in Figure 3. We note that identifying data domains in proprietary formats have applications beyond data validation (e.g., in enterprise data search), and is an interesting area for future research.
For (2), we show an analysis of all patterns in using the index in Figure 13. Figure 13(a) shows the frequency of patterns by tokenlength (each <num>, <letter>, etc. is a token). We can see that the patterns are fairly evenly distributed, where patterns with 57 token are the most common. Figure 13(b) shows the frequency of all distinct candidate patterns. We observe a powerlawlike distribution – while the “head” patterns are likely useful domain patterns, the vast majority have low coverage that are either generalized too narrowly or are not common domains.
Efficiency. Figure 14 shows the average latency (in milliseconds) to process one query column in our benchmark. We emphasize that low latency is critical for humanintheloop scenarios (e.g., TFDV), where users are expected to verify suggested constraints.
On the right of the figure, we can see the average latency for PWheel, FalshProfile, and XSystem, respectively (we use authors’ original implementations for the last two methods (FlashProfileCode; xsystemcode)). It can be seen that these existing patternprofiling techniques all require on average 67 seconds per column. Given that tables often have tens of columns, the endtoend latency can be slow.
In comparison, we observe that despite using a more involved algorithm and a large corpus , all FMDV variants are two orders of magnitude faster, where the most expensive FMDVVH takes only 0.082 seconds per column. This demonstrates the benefit of the offline indexing step in Section 2.4, which distills (in terabytes) down to a small index with summary statistics (with less than one gigabyte), and pushes expensive reasoning to offline, allowing for fast online response time. As an additional reference point, if we do not use offline indexes, the “FMDV (noindex)” method has to scan for each query and is many orders of magnitude slower.
For the offline indexing step, we report that the endtoend latency of our job (on a cluster with 10 virtualnodes) ranges from around 1 hour (with ), to around 3 hours (with ). We believe this shows that our algorithm is viable even on smallscale clusters, despite using a large number of patterns.
User study. To evaluate the human benefit of suggesting datavalidation patterns, we perform a user study by recruiting 5 programmers (all with at least 5 years of programming experience), to write datavalidation patterns for 20 sampled test columns in benchmark . This corresponds to the scenario of developers manually writing patterns without using the proposed algorithm.
Programmer  avgtime (sec)  avgprecision  avgrecall 
#1  145  0.65  0.638 
#2  123  0.45  0.431 
#3  84  0.3  0.266 
FMDVVH  0.08  1.0  0.978 
We report that 2 out of the 5 users fail completely on the task (their regex are either illformed or fail to match given examples). Table 3 shows the results for the remaining 3 users. We observe that on average, they spend 117 seconds to write a regex for one test column (for it often requires many trialsanderrors). This is in sharp contrast to 0.08 seconds required by our algorithm. Programmers’ average precision is 0.47, also substantially lower than the algorithm’s precision of 1.0 on holdout test data.
Case studies using Kaggle. In order to use publicly available data sets to assess the benefits of data validation against schemadrift, we conduct a case study using 11 tasks sampled from Kaggle (kaggle), where each task has at least 2 stringvalued categorical attributes. These 11 tasks include 7 for classification: Titanic, AirBnb, BNPParibas, RedHat, SFCrime, WestNile, WalmartTrips; and 4 for regression: HousePrice, HomeDepot, Caterpillar, and WalmartSales.
To simulate schemadrift (breck2019data; polyzotis2017data; schelter2018automating), for each Kaggle task, we keep the training data unchanged, but swap the position of the categorical attributes in the testing data (e.g., if a data set has two attributes date and time at columnposition 1 and 2, respectively, after simulated schemadrift the two attributes will be at columnposition 2 and 1, respectively). This creates a small schemamismatch between training/testing data that is hard to detect but can be common in practice like explained in (breck2019data; polyzotis2017data).
Figure 15
shows the quality results with and without validation on these Kaggle tasks. For all tasks, we use a popular GBDT method called XGBoost
(xgboost) with default parameters. We report R2 for regression tasks and averageprecision for classification tasks.The left (blue) bar of each task, labeled as NoSchemaDrift, shows the predictionquality scores, on the original datasets and without schemadrift. We normalize these scores as 100%. The middle (green) bar, SchemaDriftwithoutValidation, shows the quality scores with schemadrift, measured relative to the original quality scores. We observe a drop up to 78% (WalmartTrips) in normalized scores. Lastly, the right (red) bar, SchemaDriftwithValidation shows the quality when datavalidation is used, which correctly detect schemadrift in 8 out of 11 tasks (all except WestNile, HomeDepot and WalmartTrips, with no falsepositives). Addressing such validation alerts would significantly boost the resulting quality scores. s While it is generally known that schemadrift hurts ML quality (breck2019data; TFDV), our analysis quantifies its impact on publicly available data and confirms the importance of datavalidation.
6. Related Works
Data Validation. Notable recent efforts on data validation include Google’s TensorFlow Data Validation (TFDV) (breck2019data; TFDV) and Amazon’s Deequ (Deequ; schelter2018automating). These offerings allow developers to write declarative data quality constraints to describe how “normal” data should look like, which are described in detail in the introduction.
Error Detection. There is an influential line of work on errordetection, including methods that focus on multicolumn dependencies, such as FDs and denial constraints (e.g., (berti2018discovery; chiang2008discovering; chu2013; dasu2002mining; golab2010data; ilyas2004cords; kivinen1995approximate; yancoded)), where most techniques would require some amount of human supervision (e.g., manually specified/verified constraints). AutoValidate is unsupervised and focuses on singlecolumn constraints, which naturally complements existing work.
Recent approaches propose learningbased errordetection methods (e.g., (heidari2019holodetect; huang2018auto; liu2020picket; mahdavi2019raha; qahtan2019anmat; yan2020scoded; wang2019uni)), some of which are unsupervised (huang2018auto; liu2020picket; wang2019uni) and similar in spirit to AutoValidate.
Pattern Profiling. There is also a long line of work on pattern profiling, including pioneering methods such as Potter’s wheel (raman2001potter) that leverages the MDL principle, and other related techniques (ilyas2018extracting; fisher2005pads; fisher2008dirt; naumann2014data). While both data validation and profiling produce patterns, data profiling only aims to produce patterns to “cover” given values in a column , without considering valid values from the same domain but not in , and is thus prone to falsepositives when used for data validation (for data that arrive in the future).
Other forms to validation. While patternbased validation is a natural fit for machinegenerated data; alternative forms of validation logic can be more suited for other types of data. For example, for naturallanguage data drawn from a fixed vocabulary (e.g., countries or airportcodes), dictionarybased validation learned from examples (e.g., set expansion (he2011seisa; pantel2009web; wang2007language)) is applicable. For complex types, semantictype validation (hulsebos2019sherlock; yan2018synthesizing; zhang2019sato) is also a suitable choice.
7. Conclusions and Future Work
Observing the need to automate datavalidation in production pipelines, we propose a corpusdriven approach to inferring singlecolumn constraints that can be used to auto=validate stringvalued data. Possible directions of future work include extending beyond “machinegenerated data” to consider naturallanguagelike data, and extending the same validation principle also to numeric data.
Comments
There are no comments yet.