Error detection is a natural first step in every data analysis pipeline (IlyasC15, ; osborne2013best, ). Data inconsistencies due to incorrect or missing data values can have a severe negative impact on the quality of downstream analytical results. However, identifying errors in a noisy dataset can be a challenging problem. Errors are often heterogeneous and exist due to a diverse set of reasons (e.g., typos, integration of stale data values, or misalignment), and in many cases can be rare. This makes manual error detection prohibitively time consuming.
Several error detection methods have been proposed in the literature to automate error detection (Rahm00, ; IlyasC15, ; Fan2012, ; HalevyBook, ). Most of the prior works leverage the side effects of data errors to solve error detection. For instance, many of the proposed methods rely on violations of integrity constraints (IlyasC15, ) or value-patterns (wrangler, ) or duplicate detection (Elmagarmid07, ; 2010Naumann, )Dasu2012, ; Wu:2013, ; 2015combining, ) methods to identify erroneous records. While effective in many cases, these methods are tailored to specific types of side effects of erroneous data. As a result, their recall for identifying errors is limited to errors corresponding to specific side effects (e.g., constraint violations, duplicates, or attribute/tuple distributional shifts) (AbedjanCDFIOPST16, ).
One approach to address the heterogeneity of errors and their side effects is to combine different detection methods in an ensemble (AbedjanCDFIOPST16, ). For example, given access to different error detection methods, one can apply them sequentially or can use voting-based ensembles to combine the outputs of different methods. Despite the simplicity of ensemble methods, their performance can be sensitive to how different error detectors are combined (AbedjanCDFIOPST16, ). This can be either with respect to the order in which different methods are used or the confidence-level associated with each method. Unfortunately, appropriate tools for tuning such ensembles are limited, and the burden of tuning these tools is on the end-user.
A different way to address heterogeneity is to cast error detection as a machine learning (ML) problem, i.e., a binary classification problem: given a dataset, classify its entries as erroneous or correct. One can then train an ML model to discriminate between erroneous and correct data. Beyond automation, a suitably expressive ML model should be able to capture the inherent heterogeneity of errors and their side effects and will not be limited to low recall. However, the end-user is now burdened with the collection of enough labeled examples to train such an expressive ML model.
1.1. Approach and Technical Challenges
We propose a few-shot learning framework for error detection based on weak supervision (Ratner:2017:SRT:3173074.3173077, ; DBLP:conf/kdd/Re18, ), which exploits noisier or higher-level signals to supervise ML systems. We start from this premise and show that data augmentation (Perez2017TheEO, ; 45820, ), a form of weak supervision, enables us to train high-quality ML-based error detection models with minimal human involvement.
Our approach exhibits significant improvements over a comprehensive collection of error detection methods: we show that our approach is able to detect errors with an average precision of ~94% and an average recall of ~93%, obtaining an average improvement of 20 points against competing error detection methods. At the same time, our weakly supervised methods require access to 3 fewer labeled examples compared to other ML approaches. Our ML-approach also needs to address multiple technical challenges:
[Model] The heterogeneity of errors and their side effects makes it challenging to identify the appropriate statistical and integrity properties of the data that should be captured by a model in order to discriminate between erroneous and correct cells. These properties correspond to attribute-level, tuple-level, and dataset-level features that describe the distribution governing a dataset. Hence, we need an appropriately expressive model for error detection that captures all these properties (features) to maximize recall.
Often, errors in a dataset are limited. ML algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. The features of the minority class are treated as noise and are often ignored. Thus, there is a high probability of misclassification of the minority class as compared to the majority class. To deal with imbalance, one needs to develop strategies to balance classes in the training data. Standard methods to deal with the imbalance problem such as resampling can be ineffective due to error heterogeneity as we empirically show in our experimental evaluation.
[Heterogeneity] Heterogeneity amplifies the imbalance problem as certain errors and their side effects can be underrepresented in the training data. Resampling the training data does not ensure that errors with different properties are revealed to the ML model during training. While active learning can help counteract this problem in cases of moderate imbalance (chawla2004special, ; Ertekin:2007:ALC:1277741.1277927, ), it tends to fail in the case of extreme imbalance (He:2013:ILF:2559492, ) (as in the case of error detection). This is because the lack of labels prevents the selection scheme of active learning from identifying informative instances for labeling (He:2013:ILF:2559492, ). Different methods that are robust to extreme imbalance are needed.
A solution that addresses the aforementioned challenges needs to: (1) introduce an expressive model for error detection, while avoiding explicit feature engineering; and (2) propose novel ways to handle the extreme imbalance and heterogeneity of data in a unified manner.
1.2. Contributions and Organization
To obviate the need for feature engineering we design a representation learning framework for error detection. To address the heterogeneity and imbalance challenges we introduce a data augmentation methodology for error detection. We summarize the main contributions as follows:
We introduce a template ML-model to learn a representation that captures attribute-, tuple-, and dataset-level features that describe a dataset. We demonstrate that representation learning obviates the need for feature engineering. Finally, we show via ablation studies that all granularities need to be captured by error detection models to obtain high-quality results.
We show how to use data augmentation to address data imbalance. Data augmentation proceeds as follows: Given a small set of labeled data, it allows us to generate synthetic examples or errors by transforming correct examples in the available training data. This approach minimizes the amount of manually labeled examples required. We show that in most cases a small number of labeled examples are enough to train high-quality error detection models.
We present a weakly supervised method to learn data transformations and data augmentation policies (i.e., the distribution over those data transformation) directly from the noisy input dataset. The use of different transformations during augmentation provides us with examples that correspond to different types of errors, which enables us to address the aforementioned heterogeneity challenge.
The remainder of the paper proceeds as follows: In Section 2 we review background concepts. Section 3 provides an overview of our weak supervision framework. In Section 4, we introduce our representation learning approach to error detection. In Section 5, we establish a data augmentation methodology for error detection, and in Section 6, we evaluate our proposed solutions. We discuss related work in Section 7 and summarize key points of the paper in Section 8.
We review basic background material for the problems and techniques discussed in this paper.
2.1. Error Detection
The goal of error detection is to identify incorrect entries in a dataset. Existing error detection methods can be categorized in three main groups: (1) Rule-based methods (holistic, ; dallachiesa2013nadeef, ) rely on integrity constraints such as functional dependencies and denial constraints, and suggest errors based on the violations of these rules. Denial Constraints (DCs) are first order logic formulas that subsume several types of integrity constraints (Chomicki:2005:MIM:1709465.1709573, ). Given a set of operators , with denoting similarity, DCs take the form where is a dataset with attributes , and are tuples, and each predicate is of the form or where , is a constant and . (2) Pattern-driven methods leverage normative syntactic patterns and identify erroneous entries such as those that do not conform with these patterns (wrangler, ). (3) Quantitative error detection focuses on outliers in the data and declares those to be errors (hellerstein2008quantitative, ). A problem related to error detection is record linkage (Elmagarmid07, ; 2010Naumann, ; HalevyBook, ), which tackles the problem of identifying if multiple records refer to the same real-world entity. While it can also be viewed as a classification problem, it does not detect errors in the data and is not the focus of this paper.
2.2. Data Augmentation
Data augmentation is a form of weak supervision (DBLP:conf/kdd/Re18, ) and refers to a family of techniques that aim to extend a dataset with additional data points. Data augmentation is typically applied to training data as a way to reduce overfitting of models (45820, ). Data augmentation methods typically consist of two components: (1) a set of data transformations that take a data point as input and generate an altered version of it, and (2) an augmentation policy that determines how different transformations should be applied, i.e., a distribution over different transformations. Transformations are typically specified by domain experts while policies can be either pre-specified (Perez2017TheEO, )
or learned via reinforcement learning or random search methods(cubuk2018autoaugment, ; DBLP:conf/nips/RatnerEHDR17, ). In contrast to prior work, we show that for error detection both transformations and policies can be learned directly from the data.
2.3. Representation Learning
The goal of representation learning is to find an appropriate representation of data (i.e., a set of features) to perform a machine learning task (Bengio:2013:RLR:2498740.2498889, ). In our error detection model we build upon three standard representation learning techniques:
Representation learning is closely related to neural networks(GoodBengCour16, )
. The most basic neural network takes as input a vectorand performs an affine transformation of the input
. It also applies a non-linear activation function(e.g., a sigmoid) to produce the output . Multiple layers can be stacked together to create more complex networks. In a neural network, each hidden layer maps its input data to an internal representation that tends to capture a higher level of abstraction.
Highway Neural Networks Highway Networks, adapt the idea of having “shortcut” gates that allow unimpeded information to flow across non-consecutive layers (srivastava2015highway, ). Highway Networks are used to improve performance in many domains such as speech recognition (zhang2016highway, ) and language modeling (kim2016character, )
, and their variants called Residual networks have been useful for many computer vision problems(he2016deep, )
Distributed RepresentationsDistributed representations of symbolic data (Hinton:1986:DR:104279.104287, ) were first used in the context of statistical language model (Bengio:2003:NPL:944919.944966, ). The goal here is to learn a mapping of a token (e.g., a word) to a vector of real numbers, called a word embedding. Methods to generate these mappings include neural networks (word2vec, ), dimensionality reduction techniques such as PCA (lebret-collobert:2014:EACL, ), and other probabilistic techniques (Globerson:2007:EEC:1314498.1314572, ).
3. Framework Overview
We formalize the problem of error detection and provide an overview of our solution to error detection.
3.1. Problem Statement
The goal of our framework is to identify erroneous entries in a relational dataset . We denote the attributes of . We follow set semantics and consider to be a set of tuples. Each tuple is a collection of cells where denotes the value of attribute for tuple . We use to denote the set of cells contained in . The input dataset can also be accompanied by a set of integrity constraints , such as Denial Constraints as described in Section 2.1.
We assume that errors in appear due to inaccurate cell assignments. More formally, for a cell in we denote by its unknown true value and its observed value. We define an error in D to be each cell with . We define a training dataset to be a set of tuples where . provides labels (i.e., correct or erroneous) for a subset of cells in . We also define a variable for each cell with indicating that the cell is erroneous and with indicating that the cell is correct. For each we denote its unknown true assignment.
Our goal is stated as follows: given a dataset and a training dataset find the most probable assignment to each variable with . We say that a cell is correctly classified as erroneous or correct when .
3.2. Model Overview
Prior models for error detection focus on specific side effects of data errors. For example, they aim to detect errors by using only the violations of integrity constraints or aim to identify outliers with respect to the data distribution that are introduced due to errors. Error detectors that focus on specific side effects, such as the aforementioned ones, are not enough to detect errors with a high recall in heterogeneous datasets (Abedjan_2016, ). This is because many errors may not lead to violations of integrity constraints, nor appear as outliers in the data. We propose a different approach: we model the process by which the entries in a dataset are generated, i.e., we model the distribution of both correct and erroneous data. This approach enables us to discriminate better between these two types of data.
We build upon our recent Probabilistic Unclean Databases (PUDs) framework that introduces a probabilistic framework for managing noisy relational data (puds, ). We follow the abstract generative model for noisy data from that work, and introduce an instantiation of that model to represent the distribution of correct and erroneous cells in a dataset.
We consider a noisy channel model for databases that proceeds in two steps: First, a clean database is sampled from a probability distribution. Distribution captures how values within an attribute and across attributes are distributed and also captures the compatibility of different tuples (i.e., it ensures that integrity constraints are satisfied). To this end, distribution is defined over attribute-, tuple-, and dataset-level features of a dataset. Second, given a clean database sampled by
, errors are introduced via a noisy channel that is described by a conditional probability distribution. Given this model, characterizes the probability of the unknown true value of a cell and characterizes the conditional probability of its observed value. Distribution is such that errors in dataset lead to low probability instances. For example, assigns zero probability to datasets with entries that lead to constraint violations.
The goal is to learn a representation that captures the distribution of the correct cells () and how errors are introduced (). Our approach relies on learning two models:
(1) Representation Model We learn a representation model that approximates distribution on the attribute, record, and dataset level. We require that is such that the likelihood of correct cells given will be high, while the likelihood of erroneous cells given is low. This property is necessary for a classifier to discriminate between correct and erroneous cells when using representation . We rely on representation learning techniques to learn jointly with .
(2) Noisy Channel We learn a generative model that approximates distribution . This model consists of a set of transformations and a policy . Each transformation corresponds to a function that takes as input a cell and transforms its original value to a new value , i.e., . Policy is defined as a conditional distribution . As we describe next, we use this model to generate training data—via data augmentation—for learning and .
We now present the architecture of our framework. The modules described next are used to learn the noisy channel , perform data augmentation by using , and learn the representation model jointly with a classifier that is used to detect errors in the input dataset.
3.3. Framework Overview
Our framework takes as input a noisy dataset , a training dataset , and (optionally) a set of denial constraints . To learn , , and from this input we use three core modules:
Module 1: Data Augmentation This module learns the noisy channel and uses it to generate additional training examples by transforming some of the labeled examples in . The output of this module is a set of additional examples . The operations performed by this module are:
(1) Transformation and Policy Learning: The goal here is to learn the set of transformations and the policy that follow the data distribution in . We introduce a weakly supervised algorithm to learn and . This algorithm is presented in Section 5.
(2) Example Generation: Given transformations and policy , we generate a set of new training examples that is combined with to train the error detection model. To ensure high-quality training data, this part augments only cells that are marked correct in . Using this approach, we obtain a balanced training set where examples of errors follow the distribution of errors in . This is because transformations are chosen with respect to policy which is learned from .
Module 2: Representation This module combines different representation models to form model . Representation maps a cell values to to a fixed-dimension real-valued vector . To obtain we concatenate the output of different representation models, each of which targets a specific context (i.e., attribute, tuple, or dataset context).
We allow a representation model to be learned during training, and thus, the output of a representation model can correspond to a vector of variables (see Section 4). For example, the output of a representation model can be an embedding obtained by a neural network that is learned during training or may be fixed to the number of constraint violations value participates in.
Module 3: Model Training and Classification This module is responsible for training a classifier that given the representation of a cell value determines if it is correct or erroneous, i.e., . During training, the classifier is learned by using both the initial training data and the augmentation data . At prediction time, the classifier takes as input the cell value representation for all cells in and assigns them a label from (see Section 4).
An overview of how the different modules are connected is shown in Figure 1. First, Module 1 learns transformations and policy . Then, Module 2 grounds the representation model of our error detection model. Subsequently, is connected with the classifier model in Module 3 and trained jointly. The combined model is used for error detection.
4. Representations of Dirty Data
We describe how to construct the representation model (see Section 3.2). We also introduce the classifier model , and describe how we train and .
4.1. Representation Models
To approximate the data generating distribution , the model needs to capture statistical characteristics of cells with respect to attribute-level, tuple-level, and dataset-level contexts. An overview of model is shown in Figure 2(A). As shown, is formed by concatenating the outputs of different models. Next, we review the representation models we use for each of the three contexts. The models introduced next correspond to a bare-bone set that captures all aforementioned contexts, and is currently implemented in our prototype. More details on our implementation are provided in Appendix A.1. Our architecture can trivially accommodate additional models or more complex variants of the current models.
Attribute-level Representation: Models for this context capture the distributions governing the values and format for an attribute. Separate models are used for each attribute in dataset . We consider three types of models: (1) Character and token sequence models that capture the probability distribution over sequences of characters and tokens in cell values. These models correspond to learnable representation layers. Figure 2
(B) shows the deep learning architecture we used for learnable layers. (2)Format models
that capture the probability distribution governing the format of the attribute. In our implementation, we consider an n-gram model that captures the format sequence over the cell value. Each n-gram is associated with a probability that is learned directly from dataset. The probabilities are aggregated to a fixed-dimension representation by taking the probabilities associated with the least- probable n-grams. (3) Empirical distribution models that capture the empirical distribution of the attribute associated with a cell. These can be learned directly from the input dataset . The representation here is a scalar that is the empirical probability of the cell value.
Models for this context capture the joint distribution of different attributes. We consider two types of models: (1)Co-occurrence models that capture the empirical joint distribution over pairs of attributes. (2) A learnable tuple representation, which captures the joint distribution across attributes given the observed cell value. Here, we first obtain an embedding of the tuple by following standard techniques based on word-embedding models (Q17-1010, )
. These embeddings are passed through a learnable representation layer (i.e., a deep network) that corresponds to an additional non-linear transform (see Figure2(B)). For co-occurrence, we learn a single representation for all attributes. For tuple embeddings, we learn a separate model per attribute.
Dataset-level Representation: Models for this context capture a distribution that governs the compatibility of tuples and values in the dataset . We consider two types of models: (1) Constraint-based models that leverage the integrity constraints in (if given) to construct a representation model for this context. Specifically, for each constraint we compute the number of violations associated with the tuple of the input cell. (2) A neighborhood-based representation of each cell value that is informed by a dataset-level embedding of transformed via a learnable layer. Here, we train a standard word-embedding model where each tuple in is considered to be a document. To ensure that the embeddings are not affected by the sequence of values across attributes we extend the context considered by word-embeddings to be the entire tuple and treat the tuple as a bag-of-words. These embeddings are given as input to a learnable representation layer that follows the architecture in Figure 2(B).
The outputs of all models are concatenated into a single vector that is given as input to Classifier . Learnable layers are trained jointly with . To achieve high-quality error detection, features from all contexts need to be combined to form model . In Section 6, we present an ablation study which demonstrates that all features from all types of contexts are necessary to achieve high-quality results.
4.2. Error Classification
of our framework corresponds to a two-layer fully-connected neural network, with a ReLU activation layer, and followed by aSoftmax layer. The architecture of is shown in Figure 2(C). Given the modular design of our architecture, Classifier can be easily replaced with other models. Classifier is jointly trained with the representation model by using the training data in and the data augmentation output . We use ADAM (journals/corr/KingmaB14, ) to train our end-to-end model.
More importantly, we calibrate the confidence of the predictions of using Platt Scaling (DBLP:conf/icml/GuoPSW17, ; PlattProbabilisticOutputs1999, ) on a holdout-set from the training data (i.e., we keep a subset of for calibration). Platt Scaling proceeds as follows: Let be the score for class output by . This score corresponds to non-probabilistic prediction. To convert it to a calibrated probability, Platt Scaling learns scalar parameters and outputs as the calibrated probability for prediction . Here,
denotes the sigmoid function. Parametersand are learned by optimizing the negative log-likelihood loss over the holdout-set. It is important to note that the parameters of and are fixed at this stage.
5. Data Augmentation Learning
Having established a representation model for the data generating distribution , we now move to modeling the noisy channel distribution . We assume the noisy channel can be specified by a set of transformation functions and a policy (i.e., a conditional distribution over given a cell value). Our goal is to learn and from few example errors and use it to generate training examples to learn model .
5.1. Noisy Channel Model
We aim to limit the number of manually labeled data required for error detection. Hence, we consider a simple noisy channel model that can be learned from few and potentially noisy training data. Our noisy channel model treats cell values as strings and introduces errors to a clean cell value by applying a transformation to obtain a new value . We consider that each function belongs to one of the following three templates:
Exchange characters: (the left side and right side are different)
Given these templates, we assume that the noisy channel model introduces errors via the following generative process: Given a clean input value , the channel samples a transformation from a conditional distribution , i.e., and applies once to a substring or position of the input cell value. We refer to as a policy. If the transformation can be applied to multiple positions or multiple substrings of one of those positions or strings is selected uniformly at random.
For example, to transform Zip Code “60612” to “606152”, the noisy channel model we consider can apply the exchange character function , i.e., exchange the entire string. Applying the exchange function on the entire cell value can capture misaligned attributes or errors due to completely erroneous values. However, the same transformed string can also be obtained by applying either the exchange character function on the ‘12’ substring of “60612” or the add character function , where the position between ‘1’ and ‘2’ in “60612” was chosen at random. The distribution that corresponds to the aforementioned generative process dictates the likelihood of each of the above three cases.
Given and , we can use this noisy channel on training examples that correspond to clean tuples to augment the available training data. However, both and have to be learned from the limited number of training data. This is why we adopt the above simple generative process. Despite its simplicity, we find our approach to be effective during data augmentation (see Section 6). Next, we introduce algorithms to learn and assuming access to labeled pairs of correct and erroneous values with . We then discuss how to construct either by taking a subset of the input training data or, in the case of limited training data, via an unsupervised approach over dataset . Finally, we describe how to use and to perform data augmentation.
5.2. Learning Transformations
We use a pattern matching approach to learn the transformations. We follow a hierarchical pattern matching approach to identify all different transformations that are valid for each example in . For example, for we want to extract the transformations
. The approach we follow is similar to the Ratcliff-Obershelp pattern recognition algorithm(Ratcliff, ). Due to the generative model we described above, we are agnostic to the position of each transformation.
The procedure is outlined in Algorithm 1. Given an example from , it returns a list of valid transformations extracted from the example. The algorithm first extracts the string level transformation , and then proceeds recursively to extract additional transformations from the substrings of and . To form the recursion, we identify the longest common substring of and , and use that to split each string into its prefix (denoted by ) and its postfix (denoted by ). Given the prefix and the postfix substrings, we recurse on the combination of substrings that have the maximum similarity (i.e., overlap). We compute the overlap of two strings as , where is the number of common characters in the two strings, and is the sum of their lengths. Finally, we remove all identity (i.e., trivial) transformations from the output . To construct the set of transformations , we take the set-union of all lists generated by applying Algorithm 1 to each entry .
5.3. Policy Learning
The set of transformations extracted by Algorithm 1 correspond to all possible alterations our noisy channel model can perform on a clean dataset. Transformations in range from specialized transformations for specific entries (e.g., ) to generic transformations, such as , that can be applied to any position of any input. Given , the next step is to learn the transformation policy , i.e., the conditional probability distribution for any input value . We next introduce an algorithm to learn .
We approximate via a two-step process: First, we compute the empirical distribution of transformations informed by the transformation lists output by Algorithm 1. This process is described in Algorithm 2. Second, given an input string , we find all transformations in such that is a subset of . Let be the set of such transformations. We obtain a distribution by re-normalizing the empirical probabilities from the first step. This process is outlined in Algorithm 3. Recall that we choose this simple model for as the number of data points in can be limited.
5.4. Generating Transformation Examples
We describe how to obtain examples to form the set , which we use in learning the transformations (Section 5.2) and the policy (Section 5.3). First, any example in the training data that corresponds to an error can be used. However, given the scarcity of errors in some datasets, examples of errors can be limited. We introduce a methodology based on weak-supervision to address this challenge.
We propose a simple unsupervised data repairing model over dataset and use its predictions to obtain transformation examples . We form examples with by taking an original cell value and the repair suggested by . We only require that this model has relatively high-precision. High-precision implies that the repairs performed by are accurate, and thus, the predictions correspond to true errors. This approach enables us to obtain noisy training data that correspond to good samples from the distribution of errors in . We do not require this simple prediction model to have high recall, since we are only after producing example errors, not repairing the whole data set.
We obtain a simple high-precision data repairing model by training a Naïve Bayes model over Dataset . Specifically, we iterate over each cell in
, pretend that its value is missing and leverage the values of other attributes in the tuple to form a Naïve Bays model that we use to impute the value of the cell. The predicted value corresponds to the suggested repair for this cell. Effectively, this model takes into account value co-occurrence across attributes. Similar models have been proposed in the literature to form sets of potential repairs for noisy cells(hc, ). To ensure high precision, we only accept only repairs with a likelihood more than 90%. In Section 6, we evaluate our Naïve Bayes-based model and show that it achieves reasonable precision (i.e., above 70%).
5.5. Data Augmentation
To perform data augmentation, we leverage the learned and and use the generative model described in Section 5.1. Our approach is outlined in Algorithm 4: First, we sample a correct example with cell value from the training data . Second, we sample a transformation from distribution . If can be applied in multiple positions or substrings of input we choose one uniformly at random, and finally, compute the transformed value . Value corresponds to an error as we do not consider the identity transformation. Finally, we add in the set of augmented examples with probability . Probability is a hyper-parameter of our algorithm, which intuitively corresponds to the required balance in the overall training data. We set via cross-validation over a holdout-set that corresponds to a subset of . This is the same holdout-set used to perform Platt scaling during error classification (see Section 4.2).
We compare our approach against a wide-variety of error detection methods on diverse datasets. The main points we seek to validate are: (1) is weak supervision the key to high-quality (i.e., high-precision and high-recall) error detection models, (2) what is the impact of different representation contexts on error detection, (3) is data augmentation the right approach to minimizing human exhaust. We also perform extensive micro-benchmark experiments to examine the effectiveness and sensitivity of data augmentation.
|Dataset||Size||Attributes||Labeled Data||Errors (# of cells)|
6.1. Experimental Setup
We describe the dataset, metrics, and settings we use.
Datasets: We use five datasets from a diverse array of domains. Table 1 provides information for these datasets. As shown the datasets span different sizes and exhibit various amounts of errors: (1) The Hospital dataset is a benchmark dataset used in several data cleaning papers (holistic, ; hc, ). Errors are artificially introduced by injecting typos. This is an easy benchmark dataset; (2) The Food dataset contains information on food establishments in Chicago. Errors correspond to conflicting zip codes for the same establishment, conflicting inspection results for the same establishment on the same day, conflicting facility types for the same establishment and many more. Ground truth was obtained by manually labeling 3,000 tuples; (3) The Soccer dataset provides information about soccer players and their teams. The dataset and its ground truth are provided by Rammerlaere and Geerts (Rammelaere:2018:ERD:3236187.3269456, ); (4) Adult contains census data is a typical dataset from the UCI repository. Adult is also provided by Rammerlaere and Geerts (Rammelaere:2018:ERD:3236187.3269456, ); (5) Animal was provided by scientists at UC Berkeley and has been used by Abedjan et al. (AbedjanCDFIOPST16, ) as a testbed for error detection. It provides information about the capture of animals, including the time and location of the capture and other information for each captured animal. The dataset comes with manually curated ground truth. The datasets used in our experiments exhibit different error distributions. Hospital contains only typos, Soccer (Rammelaere:2018:ERD:3236187.3269456, ) and Adult (Rammelaere:2018:ERD:3236187.3269456, ) have errors that were introduced with BART (Arocena2015, ): Adult has 70% typos and 30% value swaps, and Soccer has 76% typos and 24% swaps. Finally, the two datasets with real-world errors have the following error distributions: Food has 24% typos and 76% value swaps (based on the sampled ground truth); Animal has 51% typos and 49% swaps.
|Dataset ( size)||M||AUG||CV||HC||OD||FBI||LR||SuperL||SemiL||ActiveL|
n/a = Semi-supervised learning did not terminate after two days.
Methods: We compare our approach, referred to as AUG, against several competing error detection methods. First, we consider three baseline error detection models:
Constraint Violations (CV): This method identifies errors by leveraging violations of denial constraints. It is a proxy for rule-based errors detection methods (holistic, ).
HoloClean (HC): This method combines CV with HoloClean (hc, ), a state-of-the-art data repairing engine. This method aims to improve the precision of the CV detector by considering as errors not all cells in tuples that participate in constraint violations but only those cells whose value was repaired (i.e., their initial value is changed to a different value).
Outlier Detection (OD): This method follows a correlation based outlier detection approach. Given a cell that corresponds to an attribute , the method considers all correlated attributes in with rely on the pair-wise conditional distributions to detect if the value of a cell corresponds to an outlier.
Forbidden Item Sets (FBI): This method captures unlikely value co-occurrences in noisy data (rammelaere2017cleaning, ). At its core, this method leverages the lift measure from association rule mining to identify how probably a value co-occurrence is, and uses this measure to identify erroneous cell values.
Logistic Regression (LR)
: This method corresponds to a supervised logistic regression model that classifies cells are erroneous or correct. The features of this model correspond to pairwise co-occurrence statistics of attribute values and constraint violations. This model corresponds to a simple supervised ensemble over the previous two models.
We also consider three variants of our model where we use different training paradigms. The goal is to compare data augmentation against other types of training. For all variations, we use the representation and the classifier introduced in Section 3. We consider the following variants:
Supervised Learning (SuperL): We train our model using only the training examples in .
Semi-supervised Learning (SemiL): We train our model using self-training (zhu2007semi, ). First supervised learning used to train the model on the labeled data only. The learned model is then applied to the entire dataset to generate more labeled examples as input for a subsequent round of supervised learning. Only labels with high confidence are added at each step.
Active Learning (ActiveL): We train our model using an active learning method based on uncertainty sampling (settles2012active, ). First, supervised learning is used to train the model. At each subsequent round, we use an uncertainty-based selection scheme to obtain additional training examples and re-train the model. We use to denote the number of iterations. In our implementation, we set the upper limit of labeled examples obtained per iteration to be cells.
Evaluation Setup: To measure accuracy, we use Precision (P) defined as the fraction of error predictions that are correct; Recall (R) defined as the fraction of true error being predicted as errors by the different methods; and defined as . For training, we split the available ground truth into three disjoint sets: (1) a training set , from which is always kept as a hold-out set used for hyper parameter tuning; (2) a sampling set, which is used to obtain additional labels for active learning; and (3) a test set, which is used for evaluation. To evaluate different dataset splits, we perform runs with different random seeds for each experiment. To ensure that we maintain the coupling amongst Precision, Recall, and
, we report the median performance. The mean performance along with standard error measurements are reported in the Appendix.Seeds are sampled at the beginning of each experiment, and hence, a different set of random seeds can be used for different experiments. We use ADAM (journals/corr/KingmaB14, )
as the optimization algorithm for all learning-based model and train all models for 500 epochs with a batch-size of five examples. We run Platt Scaling for 100 epochs. All experiments were executed on a 12-core Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz with 64GB of RAM running Ubuntu 14.04.3 LTS.
6.2. End-to-end Performance
We evaluate the performance of our approach and competing approaches on detecting errors in all five datasets. Table 2 summarizes the precision, recall, and -score obtained by different methods. For Food, Soccer, Adult, and Animal we set the amount of training data to be of the total dataset. For Hospital we set the percentage of training data to be (corresponding to 100 tuples) since Hospital is small. For Active Learning we set the number of active learning loops to to maximize performance.
As Table 2 shows, our method consistently outperforms all methods, and in some cases, like Hospital and Soccer, we see improvements of 20
points. More importantly, we find that our method is able to achieve both high recall and high precision in all datasets despite the different error distribution in each dataset. This is something that has been particularly challenging for prior error detection methods. We see that for Food and Animal, despite the fact that most errors do not correspond to constraint violations (as implied by the performance of CV), AUG can obtain high precision and recall. This is because AUG models the actual data distribution and not the side-effects of errors. For instance, for Food we see that OD can detect many of the errors—it has high recall—indicating that most errors correspond to statistical outliers. We see that AUG can successfully solve error detection for this dataset. Overall, our method achieves an average precision ofand an average recall of across these diverse datasets. At the same time, we see that the performance of competing methods varies significantly across datasets. This validates the findings of prior work (AbedjanCDFIOPST16, ) that depending on the side effects of errors different error detection methods are more suitable for different datasets.
We now discuss the performance of individual competing methods. For CV, we see that it achieves higher recall than precision. This performance is due to the fact that CV marks as erroneous all cells in a group of cells that participate in a violation. More emphasis should be put on the recall-related results of CV. As shown its recall varies dramatically from 0.0 for Food and Animal to 0.998 for Adult. For OD, we see that it achieves relatively high-precision results, but its recall is low. Similar performance is exhibited by FBI that leverages a different measure for outlier detection. We see that FBI achieves high precision when the forbidden item sets have significant support (i.e., occur relatively often). However, FBI cannot detect errors that lead to outlier values which occur a limited number of times. This is why we find OD to outperform FBI in several cases.
Using HC as a detection tool is limited to these cells violating integrity constraints. Hence, using HC leads to improved precision over CV (see Hospital and Adult). This result is expected as data repairing limits the number of cells detected as erroneous to only those whose values are altered. Our results also validate the fact that HC depends heavily on the quality of the error detection used (hc, ). As shown in Food and Animal, the performance of HC is limited by the recall of CV, i.e., since CV did not detect errors accurately, HC does not have the necessary training data to learn how to repair cells. At the same time, Soccer reveals that training HC on few clean cells—the recall of CV is very high while the precision is very low indicating that most cells were marked as erroneous—leads to low precision (HC achieves a precision of 0.032 for Soccer). This validates our approach of solving error detection separately from data repairing.
We also see that LR has consistently poor performance. This result reveals that combining co-occurrence features and violations features in a linear way (i.e., via a weighted linear combination such as in LR) is not enough to capture the complex statistics of the dataset. This validates our choice of using representation learning and not engineered features.
Finally, we see that approaches that rely on representation learning model achieve consistently high precision across all datasets. This validates our hypothesis that modeling the distribution of both correct and erroneous data allows us to discriminate better. However, we see that when we rely only on the training dataset the recall is limited (see the recall for SuperL). The limited labeled examples in is not sufficient to capture the heterogeneity of errors. Given additional training examples either via Active Learning or via Data Augmentation helps improve the recall. However, Data Augmentation is more effective than Active Learning at capturing the heterogeneity of errors in each dataset, and hence, achieves superior recall to Active Learning in all cases.
Takeaway: The combination of representation learning techniques with data augmentation is key to obtaining high-quality error detection models.
6.3. Representation Ablation Study
We perform an ablation study to evaluate the effect of different representation models on the quality of our model. Specifically, we compare the performance of AUG when all representation models are used in versus variants of AUG where one model is removed at a time. We report the -score of the different variants as well as the original AUG in Figure 3. Representation models that correspond to different contexts are grouped together.
Removing any feature has an impact on the quality of predictions of our model. We find that removing a single representation model results in drops of up to 9 points across datasets. More importantly, we find that different representation models have different impact on different datasets. For instance, the biggest drop for Hospital and Soccer is achieved when the character-sequence model is removed while for Adult the highest drop is achieved when the Neighborhood representation is removed. This validates our design of considering representation models from different contexts. Takeaway: It is necessary to leverage cell representations that are informed by different contexts to provide robust and high-quality error detection solutions.
6.4. Augmentation versus Active Learning
We validate the hypothesis that data augmentation is more effective than active learning in minimizing human effort in training error detection models. In Table 2, we showed that data augmentation outperforms active learning. Furthermore, active learning needs to obtain more labeled examples to achieve comparable performance to data augmentation. In the next two experiments, we examine the performance of the two approach as we limit their access to training data.
In the first experiment, we evaluate active learning for different values of loops () over Hospital, Soccer, and Adult. We vary in . We fix the amount of available training data to . Each time we measure the score of the two algorithms. We report our results in Figure 4. Reported results correspond to median performance over ten runs.
We see that when a small number of loops is used (k=), there is a significant gap between the two algorithms that ranges between and points. Active learning achieves comparable performance with data augmentation only after loops. This corresponds to an additional () labeled examples (labeled cells). This behavior is consistent across all three datasets.
In the second experiment, we seek to push data augmentation to the limits. Specifically, we seek to answer the question, can data augmentation be effective when the number of labeled examples in is extremely small. To this end, we evaluate the performance of our system on Hospital, Soccer, and Adult as we vary the size of the training data in . The results are shown in Figure 5. As expected the performance of data augmentation is improving as more training data become available. However, we see that data augmentation can achieve good performance— score does not drop below 70%—even in cases where labeled examples are limited. These results provide positive evidence that data augmentation is a viable approach for minimizing user exhaust.
Takeaway: Our data augmentation approach is preferable to active learning for minimizing human exhaust.
6.5. Augmentation and Data Imbalance
We evaluate the effectiveness of data augmentation to counteract imbalance. Table 2 shows that using data augmentation yields high-quality error detection models for datasets with varying percentages of errors. Hence, data augmentation is robust to different levels of imbalance; each dataset in Table 2 has a different ratio of true errors to correct cells.
In Table 3, we compare data augmentation with traditional methods used to solve the imbalance problem, namely, resampling. In all the datasets, resampling had low precision and recall confirming our hypothesis discussed in Section 1: due to the heterogeneity of the errors, resampling from the limited number of negative examples was not enough to cover all types of errors. The best result for resampling was obtained in the Hospital data set ( about ), since errors are more homogeneous than other data sets.
We also evaluate the effect of excessive data augmentation: In Algorithm 4 we do not use hyper-parameter to control how many artificial examples should be generated via data augmentation. We manually set the ratio between positive and negative examples in the final training examples and use augmentation to materialize this ratio.
Our results are reported in Figure 6. We show that increasing the number of generated negative examples (errors) results in a lower accuracy as the balance between errors and correct example goes greater than , as the model suffers from the imbalance problem again, this time as too few correct examples. We see that peak performance is achieved when the training data is almost balanced for all datasets. This reveals the robustness of our approach. Nonetheless, peak performance is not achieved exactly at a 50-50 balance (peak performance for Adult is at 60%). This justifies our model for data augmentation presented in Algorithm 4 and the use of hyper-parameter .
Takeaway: Data augmentation is an effective way to counteract imbalance in error detection.
6.6. Analysis of Augmentation Learning
In this experiment, we validate the importance of learning the augmentation model (the transformations , and the policy ). We compare three augmentation strategies: (1) Random transformations Rand. Trans., where we randomly choose from a set of errors (e.g., typos, attribute value changes, attribute shifts, etc.). Here, we augment the data by using completely random transformations not inspired by the erroneous examples or the data; and (2) learned transformation , but without learning the distribution policy(Aug w/o Policy). Given an input, we find all valid transformations in and pick one uniformly at random. Table 4 shows the results for the three approaches. AUG outperforms the other two strategies. Rand. Trans. fails to capture the errors that exist in the dataset. For instance, it obtains a recall of 16.6% for Soccer. Even though the transformations are learned from the data, it is the results show that using these transformations in a way that conform with the distribution of the data is crucial in learning an accurate classifier.
Takeaway: Learning a noisy channel model from the data, i.e., a set of transformations and a policy is key to obtaining high-quality predictions.
|Dataset||AUG||Rand. Trans.||AUG w/o Policy|
6.7. Other Experiments
Finally, we report several benchmarking results: (1) we measure the runtime of different methods, (2) validate the performance of our unsupervised Naïve Bayes model for generating labeled example to learn transformations and (see Section 5.5), and (3) validate the robustness of AUG to misspecified denial constraints.
The median runtime of different methods is reported in Table 5. These runtimes correspond to prototype implementations of the different methods in Python. Also recall, that training corresponds to 500 epochs with low batch-size as reported in Section 6.1. As expected iterative methods such as SemiL and ActiveL are significantly slower than non-iterative ones. Overall, we see that AUG exhibits runtimes that are of the same order of magnitude as supervised methods.
The performance of our Naïve Bayes-based weak supervision method on Hospital, Soccer, and Adult is reported in Table 6. Specifically, we seek to validate that the precision of our weak supervision method is reasonable, and thus, by using it we obtain good examples that correspond to good examples from the true error distribution. We see that our weak supervision method achieves a precision of more than 70% in all cases. As expected its recall can be some times low (e.g., for Soccer it is 5.3%) as emphasis is put on precision.
Finally, we evaluate AUG against missing and noisy constraints. The detailed results are presented in Appendix A.2 due to space restrictions. In summary, we find AUG to exhibit a drop of at most 6 points when only 20% of the original constraints are used to missing constraints and at most 8 points when noisy constraints are used.
7. Related Work
Many algorithms and prototypes have been proposed for developing data cleaning tools (Rahm00, ; IlyasC15, ; Fan2012, ; HalevyBook, ). Outlier detection and quantitative data cleaning algorithms are after data values that looks “abnormal” with respect to the data distribution (Dasu2012, ; Wu:2013, ; 2015combining, ). Entity resolution and record de-duplication focus on identifying clusters of records that represent the same real-world entity (Elmagarmid07, ; 2010Naumann, ). Example de-duplication tools include the Data Tamer system (Stonebraker13, ), which is commercialized as Tamr. Rule-based detection proposals (abedjan_2015, ; holistic, ; WangT14, ; FanLMTY12, ; Kolahi09, ) use integrity constraints (e.g., denial constraints) to identify violations, and use the overlap among these violations to detect data errors. Prototypes such as such as Nadeef (dallachiesa2013nadeef, ), and BigDansing (Khayyat:2015, ) are example extensible rule-based cleaning systems. There have been also multiple proposals that identify data cells that don not follow a data “pattern”. Example tools include OpenRefine, Data Wrangler (wrangler, ) and its commercial descendant Trifacta, Katara (Chu15, ), and DataXFormer (Abedjan_2016, ). An overview of these tools and how they can be combined for error detection is discussed in (AbedjanCDFIOPST16, ), where the authors show that even when all are used, these tools often achieve low recall in capturing data errors in real data sets.
Data Augmentation has also been used extensively in machine learning problems. Most state-of-the-art image classification pipelines use some limited for of data augmentation (Perez2017TheEO, )
. This consists of applying crops, flips, or small affine transformations in fixed order or at random. Other studies have applied heuristic data augmentation to modalities such as audio(sound_da, ) and text (lu2006enhancing, ). To our knowledge, we are the first to apply data augmentation in relational data.
Recently, several lines of work have explored the use of reinforcement learning or random search to learn more principled data augmentation policies (cubuk2018autoaugment, ; DBLP:conf/nips/RatnerEHDR17, ). Our work here is different as we do not rely on expensive procedures to learn the augmentation policies. This is because we limit our policies to applying a single transformation at a time. Finally, recent work has explored techniques based on Generative Adversarial Networks (NIPS2014_5423, ) to learn data generation models used for data augmentation from unlabeled data (Mirza2014ConditionalGA, ). This work focuses mostly on image data. Exploring this direction for relational data is an exciting future direction.
We introduced a few-shot learning error detection framework. We adopt a noisy channel model to capture how both correct data and errors are generated use it to develop an expressive classifier that can predict, with high accuracy, whether a cell in the data is an error. To capture the heterogeneity of data distributions, we learn a rich set of representations at various granularities (attribute-level, record-level, and the dataset-level). We also showed how to address a main hurdle in this approach, which is the scarcity of error examples in the training data, and we introduced an approach based on data augmentation to generate enough examples of data errors. Our data augmentation approach learns a set of transformations and the probability distribution over these transformations from a small set of examples (or in a completely unsupervised way). We showed that our approach achieved an average precision of ~94% and an average recall of ~93% across a diverse array of datasets. We also showed how our approach outperforms previous techniques ranging from traditional rule-based methods to more complex ML-based method such as active learning approaches.
This work was supported by Amazon under an ARA Award, by NSERC under a Discovery Grant, and by NSF under grant IIS-1755676.
- (1) Z. Abedjan, C. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. Temporal rules discovery for web data cleaning. PVLDB, 9(4):336 –347, 2015.
- (2) Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment, 9(12):993–1004, 2016.
- (3) Z. Abedjan, J. Morcos, I. F. Ilyas, P. Papotti, M. Ouzzani, and M. Stonebraker. DataXFormer: A robust transformation discovery system. In ICDE, 2016.
- (4) P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algorithms. PVLDB, 9(2):36–47, 2015.
- (5) Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, Aug. 2013.
- (6) Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, Mar. 2003.
- (7) P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
- (8) N. V. Chawla, N. Japkowicz, and A. Kotcz. Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.
- (9) H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, DLRS 2016, pages 7–10, 2016.
- (10) J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1-2):90–121, Feb. 2005.
- (11) X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498–1509, 2013.
- (12) X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458–469, April 2013.
- (13) X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1247–1261. ACM, 2015.
- (14) E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- (15) M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541–552. ACM, 2013.
- (16) T. Dasu and J. M. Loh. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11):1674–1683, 2012.
- (17) A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.
- (18) A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Data Engineering, 19(1), 2007.
- (19) S. Ertekin, J. Huang, and C. L. Giles. Active learning for class imbalance problem. SIGIR ’07, pages 823–824, New York, NY, USA, 2007. ACM.
- (20) W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012.
- (21) W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. The VLDB journal, 21(2):213–238, 2012.
- (22) A. Globerson, G. Chechik, F. Pereira, and N. Tishby. Euclidean embedding of co-occurrence data. JMLR, 8:2265–2295, Dec. 2007.
- (23) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
- (24) I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA, USA, 2016.
- (25) C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1321–1330, 2017.
- (26) H. He and Y. Ma. Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, 1st edition, 2013.
- (27) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (28) J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
- (29) G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. chapter Distributed Representations, pages 77–109. MIT Press, Cambridge, MA, USA, 1986.
- (30) Z. Huang and Y. He. Auto-detect: Data-driven error detection in tables. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 1377–1392, 2018.
- (31) I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281–393, 2015.
- (32) A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
- (33) S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3363–3372. ACM, 2011.
- (34) Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215–1230, 2015.
- (35) Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush. Character-aware neural language models. In AAAI, pages 2741–2749, 2016.
- (36) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- (37) S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009.
- (38) R. Lebret and R. Collobert. Word embeddings through hellinger pca. EACL, 2014.
- (39) X. Lu, B. Zheng, A. Velivelli, and C. Zhai. Enhancing text categorization with semantic-enriched representation and training data augmentation. Journal of the American Medical Informatics Association, 13(5):526–535, 2006.
- (40) T. Mikolov, I. Sutskever, K. Chen, et al. Distributed representations of words and phrases and their compositionality. NIPS, 2013.
- (41) M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
- (42) S. U. M. P. F. G. M. E. T. K. N. T. Y. Mitsufuji. Improving music source separation based on dnns through data augmentation and network blending. 2017.
- (43) F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010.
- (44) J. W. Osborne. Best practices in data cleaning: A complete guide to everything you need to do before and after collecting your data. Sage, 2013.
- (45) L. Perez and J. Wang. The effectiveness of data augmentation in image classification using deep learning. CoRR, abs/1712.04621, 2017.
Probabilistic outputs for support vector machines and comparison to regularized likelihood methods.In Advances in Large Margin Classifiers, 2000.
- (47) N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300–311, 2015.
- (48) E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. DE, 23(4):3–13, 2000.
- (49) J. Rammelaere and F. Geerts. Explaining repaired data with cfds. Proc. VLDB Endow., 11(11):1387–1399, July 2018.
- (50) J. Rammelaere, F. Geerts, and B. Goethals. Cleaning data with forbidden itemsets. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pages 897–908. IEEE, 2017.
- (51) J. W. Ratcliff and D. E. Metzener. Pattern matching: The gestalt approach. Dr. Dobb’s Journal of Software Tools, 13(7):46, 47, 59–51, 68–72, July 1988.
- (52) A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11(3):269–282, 2017.
- (53) A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré. Learning to compose domain-specific transformations for data augmentation. In NIPS (DBLP:conf/nips/RatnerEHDR17, ), pages 3239–3249.
- (54) C. Ré. Software 2.0 and snorkel: Beyond hand-labeled data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2876–2876. ACM, 2018.
- (55) T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment, 10(11):1190–1201, 2017.
- (56) C. D. Sa, I. F. Ilyas, B. Kimelfeld, C. Ré, and T. Rekatsinas. A formal framework for probabilistic unclean databases. ICDT, 2019.
Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.
- (58) R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
- (59) M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The Data Tamer system. In CIDR, 2013.
- (60) J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457–468, 2014.
- (61) E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553–564, June 2013.
- (62) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. 2017.
Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass.
Highway long short-term memory rnns for distant speech recognition.In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 5755–5759. IEEE, 2016.
- (64) X. Zhu. Semi-supervised learning tutorial. In International Conference on Machine Learning (ICML), pages 1–135, 2007.
Appendix A Appendix
We provide additional details for the representation models in our framework and present additional micro-benchmark experimental results on the robustness of our error detection approach to noisy denial constraints.
a.1. Details on Representation Models
Our model follows the wide and deep architecture of Cheng et al. (wideanddeep, ). Thus the model can be thought of as a representation stage, where each feature is being operated on in isolation, and an inference step in which each feature has been concatenated to make a joint representation. The joint representation is then fed through a two-layer neural network. At training time, we backpropogate through the entire network jointly, rather than training specific representations. Figure 7 illustrates this model’s topology.
A summary of representation models used in our approach along with their dimensions is provided in Table 7. As shown we use a variety of models that capture all three attribute-level, tuple-level, and dataset-level contexts. We next discuss the embedding-based models and format models we use.
|Attribute-Level||Character Embedding||FastText Embedding where tokens are characters||1|
|Word Embedding||FastText Embedding where tokens are words in the cell||1|
|Format models||3-Gram: Frequency of the least frequent 3-gram in the cell||1|
|Format models||Symbolic 3-Gram; each character is replaced by a token||1|
|Empirical distribution model||Frequency of cell value||1|
|Empirical distribution model||One Hot Column ID; Captures per-column bias||1|
|Tuple-Level||Co-occurrence model||Co-occurrence statistics for a cell’s value||#attributes -1|
|Tuple representation||FasText-based embedding of the union of tokens after tokenizing each attribute value||1|
|Dataset-Level||Constraint violations||Number of violations per denial constraint||#constraints|
|Neighborhood representation||Distance to top-1 similar word using a FastText tuple embedding over the non-tokenized attribute values||1|
Embedding-based Models: We treat different views of the data as expressing different language models, and so embed each to capture their semantics. The embeddings are taken at a character, cell and tuple level tokens, and each uses a FastText Embedding in 50 dimensions (E17-2068, ; Q17-1010, ). Rather than doing inference directly on the embeddings, we employ a two-step process of a non-linear transformation and dimensionality reduction. At the non-linear transformation stage, we use a two-layer Highway Network (srivastava2015highway, ) to extract useful representations of the data. Then, a dense layer is used to reduce the dimensionality to a single dimension. In this way, the embeddings do not dominate the joint representation. Figure 2(B) shows this module more explicitly.
In addition to using these singular embeddings, we also use a distance metric on the learned corpus as a signal to be fed into the model (see Neighborhood representation). The intuition behind this representation is that in the presence of other signals that would imply a cell is erroneous, there may be some similar cell in the dataset with the correct value; hence, the distance to it will be low. For this, we simply take the minimum distance to another embedding in our corpus, and this distance is fed to the joint representation.
Forma Models (3-Grams): We follow a similar approach to that of Huang and He (DBLP:conf/sigmod/HuangH18, ). This work introduces custom language models to do outlier detection. We follow a simplified variation of this approach and use two fixed length language models. They correspond to the 3-Gram models shown in Table 7. To build these representation models, we build a distribution of 3-Grams present in each column, this is done using the empirical distribution of the data and Laplace smoothing. For 3-Gram, the distribution is based on all possible ASCII 3-Grams. The difference in the symbol based variation of 3-Gram is that the distribution is based off the alphabet . The value returned for each model is the least frequency of all 3-grams present in the cell value.
a.2. Effect of Misspecified Constraints
We conduct a series of micro-benchmark experiments to evaluate the robustness of AUG against misspecified denial constraints. First, we evaluate AUG’s performance as only a subset of constraints is given as input, and second, we evaluate AUG’s performance as constraints become noisy.
a.2.1. Limiting the number of Constraints
We consider Hospital, Adult, and Soccer with the denial constraints used for our experiments in Section 6 and perform the following experiment: For each dataset, we define a vary the number of constraints given as input to AUG by taking only a proportion of the initial constraints. We vary in , where indicates that a random subset of 20% of the constraints is used while indicates that all constraints are used. For each configuration for we obtain 21 samples of the constraints and evaluate AUG for these random subsets. We report the median , precision, and recall in Table 8. As shown, AUGs performance gradually decreases as the number of denial constraints is reduced and converges to the performance reported in the study in Section 6.3 when no constraints are used in AUG. The results in Table 8 also show that AUG is robust to small variations in the number of constraints provided as input. We see that when the score of AUG does not reduce more than two points.
a.2.2. Noisy Denial Constraints
We now turn our attention to noisy constraints. We use the following definition of noisy constraints:
Definition A.1 ().
The denial constraint is -noisy on the dataset if it satisfies percent of all tuple pairs in .
We want to see the effect of noisy denial constraints on the performance of AUG. We use the following strategy to identify noisy denial constraints for each dataset: We use the denial constraint discovery method of Chu et al. (chu2013discovering, ) and group the discovered constraints in four ranges with respect to the noise level . Constraints with , constraints with , constraints with , and constraints with . For each range, we obtain 21 constraint-set samples, such that each sampled constraint set has the same cardinality as the original clean constraints associated with each of the Hospital, Adult, and Soccer datasets. We report the median performance of AUG in Table 9. As shown, the impact of noisy denial constraints on AUG’s performance is not significant. The reason is that during training AUG can identify that the representation associated with denial constraints corresponds to a noisy feature and thus reduce its weight in the final classifier.
a.3. Learned Augmentation Policies
We provide examples of learned policies for clean entries in Hospital, Adult, and Animal. For Hospital and Adult, we know how errors were introduced, and hence, can evaluate the performance of our methods for learning augmentation policies. Errors in Hospital correspond to typos introduced artificially by swapping a character in the clean cell values with the character ‘x’. On the other hand, errors in the gender attribute of Adult are introduced either by swapping the two gender values ‘Female’ and ‘Male’ or by introducing typos via injection of characters. For Animal, we do not know how errors are introduced. However, we focus on an attribute that can only take values in R, O, Empty to evaluate the performance of our methods.
Figure 8 depicts the top-10 entries in the conditional distribution corresponding to entry ‘scip-inf-4’ for Hospital and entry ‘Female’ for Adult. As shown, for Hospital, almost all transformations learned by our method correspond to either swapping a character a character with the character ‘x’ or injecting ‘x’ in the original string. The performance of our approach is similar for Adult. We observe that a mix of value swaps, e.g., ‘Female’ ‘Male’, and character injection transformations are learned. Finally, for Animal, we see that most of the mass of the conditional distribution (almost 86%) is concentrated in the value swap transformations ‘R’ ‘Empty’ and ‘R’ ‘O’ while all other transformations have negligible probabilities. These results demonstrate that our methods can effectively learn how errors are introduced and distributed in noisy relational datasets.