Cleaning Denial Constraint Violations through Relaxation

02/14/2020 ∙ by Stella Giannakopoulou, et al. ∙ EPFL Facebook 0

Data cleaning is a time-consuming process which depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate, offline process, that takes place before analysis starts. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning data unnecessary for the analysis. This paper proposes an approach that performs probabilistic repairing of denial constraint violations on-demand, driven by the exploratory analysis that users perform. It introduces Daisy, a system that integrates data cleaning seamlessly into the analysis by relaxing query results. Daisy executes analytical query workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload, and outperforms traditional offline cleaning on both synthetic and real-world workloads.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Real-life data contain erroneous information, which leads to inaccurate data analysis (incomplete_dbs; activeclean). Data scientists spend most of their time on cleaning their data (buckley_new_2012) until they are able to extract insights. Depending on the accuracy requirements of the workload and the data they need to access, users iteratively apply cleaning tasks until they are satisfied with the resulting quality. Thus, data cleaning is subjective and time-consuming.

Data cleaning is an interactive and exploratory process which involves expensive operations. Error detection requires multiple pairwise comparisons to check the satisfiability of the constraints (cleanm). Data repairing adds an extra overhead since it requires many iterations of assigning candidate values to the erroneous cells until all rules are satisfied (holistic_dc; nadeef). At the same time, data scientists detect inconsistencies, and determine the required cleaning tasks at data exploration time (quERy). Thus, traversing the whole dataset multiple times to fix each discovered discrepancy is cost-prohibitive.

State-of-the-art approaches can be divided into offline and analysis-aware. Offline approaches (nadeef; bigdansing; holoclean) treat data cleaning as a separate process, decoupled from analysis. Applying data cleaning before analysis starts requires prior knowledge of the errors that exist. Furthermore, offline cleaning is cost-prohibitive, since it operates over the whole dataset (data_quality). Analysis-aware tools (sampleclean; activeclean; qo_dynamic_imputation) focus on entity resolution or deduplication, or limit themselves to cell-level errors. Still, entity resolution tools require expensive preprocessing (quERy), or support approximate query processing (sampleclean).

Thus, there is need for a cleaning approach which is weaved into the analysis, and which applies data cleaning on-demand. Using on-the-fly cleaning, one only cleans necessary data thereby minimizing wasted effort if only a subset of data is analyzed. Still, online cleaning also benefits offline cleaning approaches by enhancing the predictability on what cleaning tasks are required. Thus, integrating cleaning with analysis efficiently supports exploratory applications (exploratory_analysis) by reducing the number and cost of iterations required to extract insights.

Name Zip City
Jon 9001 Los Angeles
Jim 9001 San Francisco
Mary 10001 New York
Jane 10002 New York
Table 1. Employees dataset.
Example 0 ().

Consider the dataset of Table 1 which comprises employees information. Assume that a user is interested in analyzing the employees who live in Los Angeles. The insights that the user extracts might be wrong due to the conflict among the first two tuples that have the same zip code and different city name. Specifically, the first two tuples violate the functional dependency zipcity which states that the zip code defines the city. Thus, by extracting only the subset that refers to “Los Angeles” the analysis ignores the second tuple whose city value might be Los Angeles after cleaning it. Having to clean the whole dataset is unnecessary since a) the user is interested only in a subset of data, and b) the query result can be cleaned by checking only the relevant data subset.

We present an approach that intermingles cleaning denial constraint violations with exploratory analysis SPJ (Select-Project-Join) and aggregate queries, and gradually cleans the data. Denial constraints (DCs) comprise a family of rules that have been widely used to capture inconsistencies in real-life data (data_quality_srivastava; data_quality). To provide correct results over dirty data, we introduce cleaning operators inside the query plan, and employ a cost model to optimally place them. To allow cleaning operators to detect errors, we use query result relaxation at the execution level. We define and use a novel query result relaxation mechanism in the context of DCs, which enhances the query result with correlated data from the dataset to allow violation detection. Query result relaxation differs from query relaxation (online_queryrelax; queryrelax_incomplete) which relaxes the query conditions to allow query processing over incomplete databases. Then, given the detected errors, we propose candidate fixes by providing probabilistic results (probabilistic_dbs). After the execution of each query, we isolate the changes, and apply the delta to the original dataset. Thus, we incrementally obtain a clean instance of the data by adding an overhead to each query. We validate our approach by building Daisy, a distributed incremental cleaning framework over Spark (spark).

Contributions: Our contributions are as follows:

  • We present a query result relaxation mechanism which enables interleaving SPJ and aggregate query processing with cleaning DC violations. Our approach guarantees correctness compared to the offline approach in the case of functional dependencies, and provides accuracy guarantees in the presence of general DCs.

  • We introduce cleaning operators inside the query plan using a cost model which determines the execution order of cleaning and querying operators at query time. To decide on the order, the cost model takes into consideration information such as the type of rules, the type of query, and the number of violations.

  • We implement Daisy, a scale-out system, that allows exploratory data analysis queries over data with DC violations based on our design. Our experiments show that Daisy is faster than offline cleaning solutions on synthetic data, and supports real-world workloads that offline cleaning fails to address.

2. Related Work

Data cleaning has been an omnipresent data management challenge (error_detection; cleaning_challenges; datacleaning_survey). This section surveys related work and highlights how our work pushes the state of the art further.

Violations of integrity constraints: Offline cleaning systems follow a centralized or a distributed approach to tackle the problem of repairing denial constraint errors. NADEEF (nadeef) assumes known candidate fixes, while BigDansing (bigdansing) provides suggestions comprising the condition between the erroneous cells, or blindly assigns values until all constraints are satisfied (holistic_dc). BigDansing also introduces logical operators and optimizes them, but limits itself to offline cleaning, before queries arrive. LLUNATIC (llunatic) applies repairing using the principle of minimality. Daisy differs since it provides probabilistic candidate fixes for each erroneous cell.

Holoclean (holoclean)

repairs data using probabilistic inference by combining integrity constraints, master data, and quantitative statistics. Holoclean differs from Daisy in that it relies on master data, and on training based on the clean part of the dataset. Daisy leverages the provided dependencies and computes a set of candidate fixes for the erroneous entities.

NADEEF, BigDansing and Holoclean differ from Daisy in that they are offline data cleaning systems that operate over the whole dataset before data analysis starts.

Analysis-aware cleaning: QuERy (quERy) intermingles duplicate elimination with query processing. QuERy uses blocking (blocking) as a preprocessing step, and introduces operators in the query plan which operate over the blocks. QuERy also optimizes the plan involving the cleaning operators. SampleClean (sampleclean)

extracts a sample out of a dataset with duplicates and cell-level errors, asks users to clean it, and uses the sample to answer aggregate queries. SampleClean estimates the query result given the cleaned data, and corrects the error of the queries over the uncleaned data.

QuERy and SampleClean address entity resolution, duplicates or cell-level errors, whereas Daisy focuses on integrity constraints. Also, QuERy differs in that it requires a preprocessing phase to apply the blocking. Finally, SampleClean is applicable only over aggregate queries, for which sampling is allowed (quickr). ActiveClean (activeclean)

incrementally updates a machine learning model as the user gradually cleans the data. ActiveClean addresses corruption cases affecting attribute values excluding cases that involve multiple records such as integrity constraints. ImputeDB 

(qo_dynamic_imputation) considers query processing over data with missing values, and decides whether to keep or drop tuples by choosing the optimal solution in the efficiency/quality trade-off. ImputeDB also limits itself to cell-level errors.

Consistent Query Answering: The area of consistent query answering (arenas_cqa; cavsat) focuses on computing all possible repairs, and on providing query answers using the tuples that belong to every repair. Daisy differs from this idea in that it avoids fixing the whole dataset, since the repairs depend on the correlations driven by the queries and the denial constraints.

3. From Offline to
Online Data cleaning

Problem Statement: We need to efficiently clean exploratory query results in the presence of dirty data in real-time. We clean denial constraint (DC) violations (data_quality) since they involve a wide range of rules that detect semantic inconsistencies in the data. DCs are universally quantified first-order logic sentences that represent data dependencies including functional dependencies (FDs). DCs are defined as: ,…, ( ), where is a tuple, each is a predicate involving conditions between the attributes of one or more tuples, is the number of involved tuples, and is the number of predicates.

Challenges: Cleaning exploratory query results online is challenging since it must accurately clean the result without cleaning the whole dataset. Also, during data exploration, users have partial knowledge on the rules that must hold; cleaning a value given partial knowledge of the rules affects the resulting data quality (glitches). Still, even when the rules are known, automatically fixing an error might result in inaccuracies (holoclean); human effort or master data are required. Finally, since cleaning is costly compared to query processing, adding the cleaning overhead over each query might result in a higher overall cost than cleaning the whole dataset apriori.

Solution: To efficiently and accurately provide correct query results, we integrate cleaning DC violations with query processing. Specifically, we introduce cleaning operators inside the query plan to allow clean query answers. To optimize the execution of each query, we detect the relevant subset of data that affects the cleanliness of the result. We also introduce a cost model to optimally place the cleaning operators based on the overhead they add on each query. To capture partial knowledge of the rules, we clean data by providing probabilistic fixes. Then, based on our solution, once all rules are known, one can either use inference algorithms (scare; holoclean; optimal_repairs) given that master data exist, or have humans fix the errors in the query results, given the probabilistic suggestions. Inference approaches over the probabilistic values are complementary and out of the scope of this work. Future work includes examining the human cost of cleaning the flagged dirty values of the query results.

4. Query Execution
over Dirty Data

Executing queries over dirty data produce wrong query results (activeclean). A tuple might erroneously appear or be missing from a query result due to a dirty value. To fix wrong query results, we detect the potentially qualifying tuples given the constraints using query result relaxation. Query result relaxation differs from query relaxation (queryrelax_incomplete)

in that instead of relaxing the conditions of the query, it relaxes the result. Still, we relax the result to compute conflicting tuples based on the input DCs, whereas query relaxation is used to deal with failing queries and incomplete databases. After relaxing the result, we detect errors and provide probabilistic fixes based on the frequency that each candidate value appears in a dirty cell. Then, we update the dataset in-place with the probabilities, while also maintaining provenance information to the original values in case new rules appear. Thus, we gradually transform the dataset into a probabilistic dataset.

Our probabilistic representation uses attribute-level uncertainty (probabilistic_dbs); attributes take multiple candidate values. To represent candidate tuples using attribute-level representation, we store in each candidate value an identifier of the possible world it belongs. Then, query operators output a tuple iff at least one candidate value qualifies. Each tuple of the result contains all candidate values to enable reasoning over the data. For (self-)joins on probabilistic join keys, a pair qualifies iff the candidate values of the join keys overlap. For joins, we employ a similar approach to the lineage used in probabilistic data (probabilistic_dbs) by storing in the result the originating tuple ids since if a potential inference updates a join key value, a pair might no longer satisfy the join predicate.

In the following, we introduce cleaning operators that enable cleaning at query time.

Definition 0 ().

A cleaning operator is an update operator that gets as input a query result or a relation, and outputs the clean result or relation. When the cleaning operator takes input from a query operator it a) relaxes the result based on the dependencies of the input DCs, b) detects and fixes errors, and c) updates the data in-place with the clean values.

Cleaning operators differentiate between SP and Join queries. For group by queries, cleaning takes place before the aggregation. If no select or join exists, group by operates over the whole dataset, thus we push down cleaning to avoid the grouping recomputation. If there is a condition below, cleaning is placed either before or after the condition. In the following, we present the result relaxation mechanism for SP and Join queries given one or more DCs.

4.1. Relaxing SP Query Results given a FD

The cleaning operator for SP queries is defined as follows.

Definition 0 ().

is a cleaning operator that cleans the output of a select operator, by enriching the result using extra tuples from the dataset.

Consider a dataset with schema , and a FD : XY, where XS, YS. may contain multiple attributes, whereas Y contains a single attribute; if Y contained more attributes (e.g., , ), then would be mapped to multiple FDs (e.g., :X, :X(datacleaning_survey). Given a SP query with projection list P S, and where clause attributes , affects the correctness of the query iff (X Y) (P W), that is iff the query accesses an attribute of . If the query overlaps with , the cleaning operator augments the query result with tuples from the dataset which have the same value for attributes and/or . We will refer to the extra tuples as correlated tuples.

Algorithm 1 shows the general query result relaxation which uses transitive closure; it iteratively computes the correlated tuples of the result set until no more correlated tuples are detected. Consider , being the left-hand-side (lhs) and right-hand-side (rhs) values of the result respectively (lines 4,5). Algorithm 1 enhances the result with extra entities belonging to the union of the sets {} and {} (lines 6-10), where are the lhs and rhs values for each tuple . Thus, by computing the correlated tuples, determines the probabilistic values of the erroneous cells.

1:Dataset , Query answer ,
2: =
3:extra = unvisited = d - A
4:while  do
5:     = .map( )
6:     = .map( )
7:     = .filter( .contains())
8:     = -
9:     = .filter( .contains())
10:     = -
11:     =
12:return
Algorithm 1. SP query result relaxation for FDs

After relaxing the result,

detects errors and computes candidate fixes. Consider random variables

LHS, RHS representing the candidate lhs and rhs values of an erroneous tuple t. LHS contains the lhs values of the tuples for which , that is they have the same rhs value. Each candidate LHS has probability P(). Similarly RHS contains the rhs values of the tuples for which , each with probability P(). Thus, for a specific FD, based on the dependency of the attributes, each tuple can have two instances, one with the candidate given the existing and one with the candidate given the existing . Since in our internal representation we use attribute-level uncertainty, we store the dependencies inside each attribute value to denote the different candidate pairs.

Lemma 0 ().

Query result relaxation for SP queries with a filter on the of an FD requires one iteration of Algorithm 1 to accurately compute the candidate values.

Proof.

Consider a query with a filter restricting the rhs over the range [a,b]. The correct result must include both the clean tuples with values in the range [a,b], as well as the erroneous tuples that are candidates to take values in [a,b]. Given the dirty result, Algorithm 1 computes the tuples which have matching lhs with the result (line 6). We assume that in the erroneous tuples either the lhs, or the rhs contains a reliable value in order to exploit the dependency and be able to make a prediction (scare). Thus, the extra tuples with matching lhs are the candidates to get rhs in [a,b]. Thus, enhancing the result with tuples that have the same lhs guarantees no missing tuples. We also show that the included tuples contain all candidate values. Algorithm 1 covers the rhs candidate values by computing the tuples with matching lhs. Then, Algorithm 1 computes the lhs values of all tuples with matching rhs. However, these tuples are already included in the enhanced result since they satisfy the query, thus the algorithm terminates. ∎

Zip City 9001 Los Angeles 9001 San Francisco 9001 Los Angeles 10001 San Francisco 10001 New York
(a)
Zip City 10cm9001 10cmLos Angeles, 67% San Francisco, 33% 20cm9001 20cmLos Angeles, 67% San Francisco, 33% 20cm9001 50% 10001 50% 20cmSan Francisco 20cm9001 20cmLos Angeles, 67% San Francisco, 33% 10001 San Francisco 10001 New York
(b)
Table 2. Cities dataset: (a) Dirty version, (b) Partially clean version with candidate values. The dashed line denotes different candidate fixes for the erroneous tuples.
Example 0 ().

Consider the dataset of Table 2, the FD , and a query requesting the zip code of “Los Angeles”.

The dirty result consists of the first and the third tuple of Table 2. , by following Algorithm 1, enhances the result with the tuples that have the same lhs with the result, that is the tuples for which {CityLos AngelesZip=9001}={9001, San Francisco}. Afterwards, it adds the tuples of the set {Zip9001City=Los Angeles}=, that is the ones that share a rhs value with the result. However, based on the proof of Lemma4.3, this set is empty since the tuples with City=Los Angeles already appear in the result. Then, computes the candidate fixes LHS and RHS, and their candidate probabilities P(CityZip) and P(ZipCity) for the tuples of the updated result. For the first and the third tuple, LHS consists of the candidate values of the tuples that have Los Angeles. Similarly, the RHS consists of the candidate values of the tuples that have , that is San Fransisco, Los Angeles. The corresponding probabilities of each value are given by the conditional probabilities P(CityZip=9001) and P(ZipCity=Los Angeles). For the second tuple there are two candidate pairs distinguished by a dashed line in the table: {CityZip=9001}= {San Francisco 33%, Los Angeles 67% } and {ZipCity=San Francisco} = {9001 50%, 10001 50%}. The updated version of the dataset is shown in Table 2.

A filter over requires multiple iterations in order to also include the erroneous tuples which qualify the query.

Example 0 ().

Consider the dataset of Table 2 and a query requesting the city name with zip code “9001”.

Zip City
10cm9001 10cmLos Angeles, 67% San Francisco, 33%
20cm9001 20cmLos Angeles, 67% San Francisco, 33%
20cm9001 50%, 10001 50% 20cmSan Francisco
20cm9001 20cmLos Angeles, 67% San Francisco, 33%
10001 20cmSan Francisco, 50% New York, 50%
20cm9001 50%, 10001 50% 20cmSan Francisco
Table 3. Correct query result given condition on the lhs. The query result becomes accurate after traversing the dataset again to fetch more correlated entities.

The dirty result comprises the first three tuples of Table 2. However, given the conflict between the tuples with zip code , the correct result contains the four tuples shown in Table 3. The fourth tuple qualifies because it has two worlds ({{90001 50%, 10001 50%}, {10001}}), and the first one satisfies the condition. Thus, Algorithm 1 adds the tuple {10001, San Francisco} since it contains a value which appears in the result. Then, the next iteration adds the tuple {10001, New York} since 10001 belongs to the relaxed result. Thus, using transitive closure, Algorithm 1 determines the whole cluster of correlated entities.

Lemma 0 ().

Consider a query with a filter on the lhs, and a relaxed result with maximal size at iteration . Algorithm 1 requires an extra iteration to accurately compute the candidate values with probability

, where Pr is the hypergeometric distribution,

the dataset size and the dataset has violations.

Proof.

Algorithm 1 terminates when the computed augmented result contains no more errors, that is there are no tuples with the same lhs and different rhs. Consider iteration , where the relaxed answer has maximal result size . The probability that contains at least one violation is equivalent to the complement of the probability of having no violations . Using the hypergeometric distribution, we estimate over the subset , given a total population of size that contains violations. Thus, . ∎

Lemma 0 ().

Let be the set of attributes that appear in the FDs. Let be the cardinality (number of distinct values) of each attribute in the query result, and , the frequency distributions of each over the dataset and the query result respectively. The upper bound of the relaxed result size in each iteration is .

Proof.

Given that the values of the result follow a distribution , then the total frequency of these values over the dataset is . The upper bound corresponds to the worst case scenario where there is no overlap between the sets of correlated tuples stemming from each attribute . In the worst case, the number of extra tuples that the relaxation adds to the result set corresponds to the number of tuples sharing the same value for each attribute of the result. Therefore, in total, the number of tuples is: . ∎

Thus, in the case of queries restricting the , the upper bound is equivalent to .

Relaxation benefit: The extra tuples contain the pruned domain of values that a system, or a user needs to infer the correct value of an erroneous cell (holoclean). Specifically, a query result contains a set of tuples with a restricted set of values for the attributes of the constraints. The cleaning process can exploit this characteristic and extract all the correlated tuples of the result, instead of computing the candidate fixes separately for each violated tuple. Thus, instead of traversing the whole dataset for each erroneous value to compute the candidate fixes, relaxation iterates over the correlated tuples.

4.2. Relaxing SP Query Results given a DC

In the case of rules with arbitrary predicates, we use the holistic data cleaning approach (holistic_dc) to calculate the possible conditions that the dirty cells must satisfy. For example, given a rule with inequality predicates, replaces the errors with the candidate ranges that satisfy the constraints. Then, similarly to FDs, the probability of each candidate is frequency-based, given the total number of fixes.

More formally, given a DC ,(¿) and two tuples , for which ¿, then a candidate fix of the violation needs to enforce the constraint; ={ or ¡}, ={ or ¿}. Thus, each attribute value will either maintain its original value, or will obtain a value satisfying the range. In the case of DCs involving more atoms, we map the DC formula to a SAT formula (nadeef), where a subset of atoms must become false (invert their condition) in order to satisfy the formula. Thus, a possible violation fix requires updating the appropriate attribute values in order to invert the condition of the subset of atoms that cause the violation. Thus, to include all the possible combinations of violated atoms, we produce all possible candidate attribute combinations. Then, a SAT solver (sat) can decide on which atoms must remain true or need to invert their conditions (become false) in order to satisfy the whole DC formula.

The probabilities of each candidate fix are based on the frequency that each of the candidate ranges appears. We provide frequency-based probabilities to collect all possible fixes for a specific value, accompanied by their weight. Then, after having computed the candidate fixes of the data subset that affects the cleanliness of a value , an inference algorithm can repair the dirty values. Future work considers updating the probabilities after accessing more data, thereby incrementally inferring the correct value.

Example 0 ().

Consider a dataset with {salary,tax,age} values :{sal:1000,tax:0.1,age:31}, :{sal:3000,tax:0.2,age:32}, : {sal:2000,tax:0.3,age:43} and rule :,:(.salary. salary .tax.tax).

Tuples , violate , thus the candidate fixes for are {(2000 50%,3000 50%),0.2,32},{3000,(0.2 50%,0.3 50%), 32}, that is, must either take a salary less than 2000 or have tax greater than 0.3. The probabilities stem from the fact that there are two possible fixes. Given a DC with more than two atoms, the candidate values contain the conditional probabilities of all possible subsets of atoms. For example, given :,:(.salary.salary.age.age.tax .tax) which requires that both the salary and the age of the employee define her tax rate, then apart from the aforementioned candidates, we need to include the respective fix of the age field ({3000,0.2,(32 50%,43 50%)} followed by the pairwise combinations of all three candidate fixes.

1:function Estimate_Errors(data d, partitions p, rules r)
2:    ranges = split(d,p)
3:    for  in ranges do
4:        for  in ranges do
5:            if overlap(, ) then
6:               range_vio() = count_overlap()                         return range_vio
1:queries queries, data d, partitions p, rules r, threshold th
2:range_vio = Estimate_Errors(d,p,r)
3:for query in queries do
4:     = execute query
5:    range = find range of in range_vio
6:     = {for (i range) yield range_vio(i)}.sum
7:    accuracy = errors/(+errors)
8:    
9:    if accuracy th then full cleaning
10:    else
11:        partial cleaning     
Algorithm 2. Query-driven cleaning DC violations

The correlated tuples are the conflicting tuples with the query result. To detect the correlated tuples, a self theta-join is required. We adopt an optimized parallel theta-join approach (mr_joins) that maps the cartesian product to a matrix which it partitions into p uniform partitions. Using the matrix, we check arbitrary predicates between any pair of attributes; in our analysis, we focus on the more realistic case that involves conditions over the same attribute (bigdansing). We compute the theta-join incrementally, by checking the matrix subset that affects the result. Thus, we partition the subset that involves the query result and the part of the dataset that has not been accessed yet. We also prune the redundant symmetric parts of the matrix.

Partial theta-join prunes non-candidate pairs for violation because it operates at a finer granularity. First, it filters out matrix partitions which are non-qualifying for conflict; a partition with large boundary ranges is more likely to require inequality comparisons whereas smaller range partitions can be pruned. Second, it prunes non-qualifying intra-partition pairs; within a partition it restricts the candidate pairs to be checked for conflict. A detailed example showing the pruning of partial theta-join can be found in the technical report.

Example 0 ().

Consider the cartesian product matrix of Fig. 4 based on rule of Example 4.8, and two queries requesting salary ranges and respectively.

To clean the first query result, theta-join checks for violations in the orange area of Fig. 4 which it divides into partitions. Then, for the second query it constructs a matrix with vertical range (1000-5000)(2000-3000) since the subset (2000-3000) has been already checked, thus it excludes it from the comparisons. Given a smaller range, the boundaries will be the ones of Fig. 4. Theta-join can then filter out non-qualifying partitions, such as partition (4,1). It also applies intra-partition filtering to exclude non-qualifying pairs. For example, given partition (3,1) with horizontal and vertical ranges (1500, 1750) and (1000, 1750) respectively, since we are interested in checking the condition, the vertical range is transformed into (1500, 1750) since the part (1000, 1500) will not produce candidate violations.

Accuracy: DC violations affect result quality since a dirty value might get a candidate fix that satisfies the query. To compute result accuracy, estimates the theta-join selectivity using the Estimate_Errors function of Algorithm 2. The function takes as input the matrix partitions and calculates the overlap of the partition boundaries, that is the number of conflicts between them (mr_joins; inequality_joins). For example, consider ranges 3 and 4 of Fig. 4, with salary boundaries (3000,4000), (4000,5000) and tax boundaries (0.3,0.4), (0.25,0.5) respectively. The violations lie in the overlap of tax values, that is, (0.25,0.4). Thus, given query answer (line 3), we identify the ranges with which overlaps (e.g., result overlaps with range 2), and obtain the total estimated errors for these ranges. Assuming inequality conditions, the erroneous partitions that affect the result are the ones with both a row and a column smaller or bigger than the row/column of the current range. Otherwise, the erroneous partitions will contain either smaller or bigger candidate value ranges than the result range. Then, we compute if the estimated accuracy (line 6) exceeds the given threshold (input by the user) and decide to fully or partially clean the data. The range overlap of Estimate_Errors function is only applicable over the non-diagonal partitions, since for the diagonal partitions (pattern filled boxes) the ranges are equivalent, thus we also provide the support, that is, the percentage of checked diagonal partitions. The support is defined as the total partitions checked , which are the upper/lower diagonal partitions, minus the blocks of the diagonal () in the first iteration and becomes smaller depending on the accessed data (line 7). Both accuracy and support increase while accessing more entities.

Figure 3. Cleaning
Figure 4. Cleaning

4.3. Relaxing SP Results given multiple DCs

In the case of multiple rules, an erroneous cell might violate multiple rules. Thus, the probability of each fix must combine the probabilities stemming from all the rules affecting the erroneous cell. For the erroneous cells which belong to the overlapping attributes of multiple constraints, we compute the candidate values in parallel, and then merge the resulting candidates. We also maintain provenance information for each erroneous cell, and thus when multiple rules exist, we execute them over the original values and then merge with the already computed probabilities. Finally, to avoid unnecessary error checks, Daisy maintains information about which tuples have been checked for each rule.

To merge the probabilities, we compute the overlap of the violating groups. We also use the union to merge the candidate values of the overlapping cells, while also adjusting the probabilities to reflect the union of the sets. Thus, given rules :YX, :ZX, we assign to the values that violate both rules. Thus, given zipstate and citystate and two versions of a tuple with P(CA9001) and P(CALA) respectively, one for each rule, then the probability will be updated to P(CA9001LA), to match the tuples that have either zip 9001 or city LA.

Lemma 0 ().

In the presence of multiple constraints, the order of computing the candidate values of the erroneous cells obeys the commutative property.

Proof.

Consider rules and , which both involve attribute , and an erroneous tuple which violates both and .The probabilistic fix of cell of tuple based on both rules is the same regardless of the order that we check the rules. Consider merging order followed by . Based on , becomes: =[(,),(,),…,(,)], where represent the candidate values of , and each is a set comprising the conflicting tuples due to which we assign value . For example, for FDs, involves the tuples with the same and different . Then, similarly, produces the corresponding set . The final output consists of the merge of the two sets, which involves pairs , where is the union of and . Thus, since the union is commutative, the result is independent of the order of the rules. ∎

4.4. Relaxing Join results

In the case of join queries, the cleaning operator needs to examine how the existence of errors in each individual table affects the query result. The operator is defined as follows.

Definition 0 ().

is a cleaning operator which cleans the result of a join query. a) extracts the qualifying parts of the join tables, b) cleans and updates each relation separately, c) updates the result, and d) re-checks for violations.

Consider a join between and . To clean the dirty result, extracts and cleans the corresponding qualifying parts of and . To extract the qualifying parts of and , keeps provenance information (data_lineage) which allows to obtain the entities of each table from the join result, as well as update the join result after cleaning the tables. Thus, using lineage, after cleaning both tables, recomputes the join to check whether the extra tuples of each relation produce new pairs. In the case where the cleaning task transforms the join key into a probabilistic attribute, the join becomes a probabilistic join. In the following, we show that the updated join accesses the already clean tuples.

Lemma 0 ().

The updated join result stemming from the cleaned qualifying table parts, requires no extra violation checks.

Proof.

To prove the correctness of the result, we examine the possible correlation cases between the query and the rules. We assume that the result of any erroneous underlying operator in the plan has been cleaned. The possible scenarios depend on whether the join key appears in a constraint. When the join key is clean, the new tuples that relaxation produces will not qualify the join because they will contain a non-qualifying join key. If the join key attribute appears in a rule, then consider a join of R and S which both involve errors on the join key. Then, might add new tuples to both relations. However, the extra tuples of will match with tuples of which already exist in the result since they have to belong to the intersection of the join keys of and . Thus, no extra violations will exist. The same case holds for the relaxed result of . ∎

Example 0 ().

Consider tables () and () shown in Tables 4, 4, rules :Zip City, : PhoneZip, and a query requesting the name, and zip code from both Cities and Employee, for the city of “Los Angeles”:

The dirty version of the result is shown in Table 4. Similarly to Example 4.4, enhances the result with tuple as it belongs to the set {CityLos AngelesZip = 9001}. Then, after repairing the detected errors of , tuple of relation has candidate values for the . The result of is shown in Table 4. Then, the evaluation of the join matches the filtered set of with all tuples of and triggers the violation between tuples of . Thus, fixes the corresponding part and updates the result. The final version of the result is shown in Table 4.

Zip City t1 9001 Los Angeles t2 9001 San Francisco t3 10001 San Francisco
(a) Cities dataset.
Zip Name Phone 9001 Peter 23456 10001 Mary 12345 10002 Jon 12345
(b) Employee dataset.
Zip Name 9001 Peter
(c) Dirty version of the join result.
Zip 20cm9001 20cm9001, 50% 10001, 50%
(d) Relaxed result of Select Operator over Cities.
C.Zip E.Zip Name 20cm9001 9001 Peter 20cm9001, 50% 10001, 50% 9001 Peter 20cm9001, 50% 10001, 50% 20cm10001, 50% 10002, 50% Mary 20cm9001, 50% 10001, 50% 20cm10001, 50% 10002, 50% Jon
(e) Clean join result.
Table 4. Join operation over two tables that involve violations on the join key.

5. Cleaning-aware
Query Optimization & Planning

In this section, we present how we inject cleaning operators at the logical level, and show a set of optimizations that enable an optimal placement of cleaning operators in the query plan. We support queries with Select, Join and Group by clauses. The template of the supported queries is the following:

SELECT <SELECTLIST>
FROM <table name> [,(<table name>)]
[WHERE <col><op><val> [(AND/OR <col><op><val>)]]
[GROUP BY CLAUSE]

[] and () indicate optional and repeated elements. takes values . In the case of joins, we assume equi-joins. We focus on flat queries to stress the overhead of the cleaning operators over the corresponding query operators.

5.1. Cleaning operators in the query plan

The logical planner detects the query operator attributes that appear in a rule, and injects the appropriate cleaning operators in the query plan. The planner pushes cleaning operators down, closer to the data, to avoid propagating errors in the plan. Deferring the execution of a cleaning task requires a) redundant cleaning to detect errors that have propagated from the underlying query operators, and/or b) recomputing the underlying query operators that are affected by the errors. For example, consider relations and , and a query with a select condition over an attribute of that participates in a rule, followed by a join where also participates in a rule of . Executing the cleaning task after the join, might result in altering the qualifying part of by adding an extra probabilistic tuple. If the extra tuple matches with an unseen tuple of , then it will update the join result. Subsequently, the operator needs to re-check for errors over the extra accessed tuples of , which induces a redundant overhead for query execution. Thus, placing cleaning operators early in the plan avoids extra effort to fix propagated errors.

Fig. 5 shows an example query plan with a operator. Given two rules involving the zip code of and , the planner detects the overlap of the join operator with the rules, and injects the operator. To avoid recomputing the join, sends only the new tuples of each relation to the join operator. Thus, the second join corresponds to an incremental join, which updates the already computed result. The final result consists of the union of the join outputs.

Figure 5. Before and after injecting inside a plan with a join over a potentially erroneous attribute.

5.2. Cost-based optimization

In the following, we analyze the offline and incremental cleaning cost, and propose a cost model that decides on the optimal placement of cleaning operators in the query plan.

5.2.1. Traditional cleaning cost

The cost of cleaning DCs is divided into the cost of a) error detection, b) data repairing, and, optionally, c) updating the dataset with the correct values. For FDs, error detection groups data based on the lhs of the rule. The complexity of grouping assuming a hash-based algorithm over a dataset of size n is O(n) (inmenory_dbs). Similarly, for DCs, the cost is O() since a cartesian product is required, but is reduced to since the upper diagonal matrix is checked to avoid re-checking symmetric pairs. Data repairing performs multiple scans to compute the candidate values for each error; given errors, the cost is O( n). Finally, the update cost is equivalent to an outer join between the dataset and the fixed values; the cost is O(n+). The overall cost is: O(n)/O() + O( n) + O(n+), and can be repeated multiple times if many iterations are needed (holistic_dc).

5.2.2. Incremental cleaning cost

We present the incremental cleaning cost by taking into consideration the type of query.

SP queries: The cost is equivalent to the cost of computing the correlated tuples plus the traditional cleaning cost. The cost of computing the set of correlated tuples E(Q) is O(), where represents the unknown tuples. Given a dataset with tuples, is equal to in the first query, but becomes smaller after each query. Specifically, in the query, the cost is , where is the size of the result of query .

Error detection and data repairing are applied over the result set A(Q) enhanced with the extra tuples E(Q). Thus, the cost of error detection is O(+) for FDs. For DCs, the incremental cost in the query with result size is . In the worst-case, the incremental cost is the arithmetic progression (we omit the division by for simplicity):

The worst-case is when is minimized to , which occurs when one query accessing the whole dataset is executed, thus the cost is equivalent to the offline cost.

For simplicity, we denote the error detection cost as for incremental and for full cleaning. Then, given violated entities, where , the data repairing cost is O( (+)), since it checks for each error the enhanced tuples instead of checking the whole dataset.

The last step involves updating the original dataset with the fixed tuples stemming from cleaning the query result. The update performs a left-outer-join between the dataset and the clean result. Since the clean result contains probabilistic values, the update depends on the number of candidate values that the erroneous value might take. More specifically, assuming a partially probabilistic dataset at query with probabilistic values, the update cost is: O(), where is the size of each value. In total, the incremental cleaning cost is:

(1)

In the case of multiple rules, the cost differs in the error detection part, since it requires one iteration per rule. The computation of the probabilities for the erroneous entities is equivalent to the single rule case because it operates over the detected errors, regardless of the number of violated rules.

Join queries: The aforementioned cost represents the query and clean cost of each individual table that participates in a join. However, a join involves the additional cost of updating the join result. Therefore, we need to measure the maximum number of iterations. We apply formula (1) for each dataset that participates in the join, and then we add the incremental join cost which takes place between the extra tuples of one relation with the set of tuples of the other relation: . We use the formula separately in each dataset, because each dataset has different characteristics, that is different number of violations, different level of correlation among the entities, and finally the query has a different selectivity in each dataset.

5.2.3. Incremental cleaning versus Full cleaning

The decision between incremental or full cleaning depends on whether the overhead induced by the cleaning task in each query is smaller in total than applying the full cleaning followed by the execution of the queries. To estimate the costs, we employ a cost model that decides on the optimal strategy.

Cleaning at query time without considering the relaxation and the update cost, is more efficient overall than executing them over the whole dataset. Consider an unknown query workload consisting of queries. In the case of FDs, the cost since, as we show in Section 4, and are complementary and smaller than the total dataset. In addition, for DCs, the incremental error detection cost is smaller than the cartesian product. However, the cost of enhancing the query result and updating the dataset after each query might exceed the full cleaning cost. Thus, we decide on cleaning the query result or the remaining dirty part of the dataset based on the following inequality. In the offline cost we also add the query execution cost, which is :

The inequality can be simplified to the following one:

For example, when , and , then , since the query accesses the whole dataset, therefore there are no extra tuples. Thus, the cost corresponds to the full cleaning case and the inequality becomes:

We check the inequality and decide on the cleaning strategy online, while executing the queries. We estimate the number of erroneous values , as well as the number of candidate values using statistics. To approximate and , we precompute a) the group by based on the and the of the FD rules respectively, and b) a histogram to estimate the selectivity of the theta-join, following the approach of (mr_joins; inequality_joins).

Figure 6. The architecture of Daisy.

6. A system for query-driven
data cleaning

Fig. 6 shows the architecture of Daisy, which is a query-driven cleaning engine over Spark. Given a query and a dirty dataset, Daisy uses two processing levels to provide correct results.

In the first level, Daisy maps the query to a logical plan which comprises both query and cleaning operators. The logical plan takes into consideration the type of query, and the constraints to optimally place each operator. We implement the cleaning-aware logical plan by injecting cleaning operators before/after the corresponding filter and join operators at the RDD level of Spark. Specifically, Daisy extracts the attributes of the query operators, and checks if they overlap with the provided constraints. To apply the cost-based optimizations, Daisy collects statistics by pre-computing the size of the erroneous groups. Then, when checking the condition of each query, it evaluates whether the inequality (1) of Section 5.2 holds. Hence, at the logical-level Daisy decides whether to place the cleaning operator before or after the query operator.

Finally, Daisy executes the logical plan by cleaning the result of each query operator that is affected by the constraints. We implement and as extra operators inside Spark RDD. The operators, take as input the query result, relax it, and detect for violations. Then, given the detected violations, Daisy transforms the query result into a probabilistic result by replacing each erroneous value with the set of values that represent candidate fixes. Daisy also accompanies each candidate value with the corresponding probability of being a fix of the erroneous cell. After cleaning each query result, the system isolates the changes made to the erroneous tuples, and updates the original dataset accordingly. By applying the changes after each query, Daisy gradually cleans the dataset.

7. Experimental Evaluation

The experiments examine the benefits stemming from the optimizations that Daisy allows, and show how Daisy performs compared to the state-of-the-art offline cleaning approach.

Experimental Setup. All experiments run on a 7-node cluster equipped with 2Intel(R) Xeon(R) Gold 5118 CPU (12 cores per socket @ 2.30GHz), 64KB of L1 cache and 1024KB of L2 cache per core, 16MB of L3 cache shared, and 376GB of RAM. On top of the cluster runs Spark 2.2.0 with 7 workers, 14 executors, each using 4 cores and 150GB of memory.

We compare a single-node execution of Daisy with Holoclean (holoclean) since it is, to our knowledge, the only currently available probabilistic system for repairing integrity constraint violations. In the absence of a scale-out, probabilistic data cleaning system, we compare Daisy with our own offline implementation over Spark that combines the optimizations of the state-of-the-art error detection and probabilistic repairing systems. Our offline implementation is an optimized implementation that detects FD and DC errors, and provides probabilistic repairs. Error detection follows the optimizations of BigDansing (bigdansing) for FDs, and the optimized theta-join (mr_joins) for DCs. Specifically, the FD error detection applies a group by instead of an expensive self-join to efficiently detect violations. DC error detection follows the theta-join approach to efficiently compute the cartesian product. Directly comparing with BigDansing would be unfair to BigDansing because it applies inference to compute the correct value whereas Daisy computes probabilistic repairs. For data repairing, we employ an alternative to Holoclean’s pruning optimization (holoclean) to restrict the domain of candidate values for each erroneous cell. Specifically, we exploit the co-occurrences of the attribute values of the erroneous tuple with the attribute values of other tuples. Thus, similarly to Daisy , the domain of the erroneous of tuple correspond to the of the tuples that share the same with . Similarly for the .

The workload involves SP, Join, and GroupBy queries in the presence of one or more DCs. We evaluate the workload over a synthetic benchmark and three datasets derived from real-world data entries. Specifically, we use the Star Schema Benchmark (SSB) (ssb), the hospital dataset (holoclean), the Nestle dataset and a dataset with air quality data (epa-historical-air-quality).

We choose the SSB dataset to test the applicability of Daisy over a benchmark designed for data warehousing applications. We use multiple versions of the lineorder table by varying the cardinality of the orderkey and suppkey attributes; we construct different versions by varying the number of distinct orderkeys from 5K to 100K, and the number of distinct suppkeys from 100 to 10K. To measure the worst-case scenario, we add errors to all orderkeys by randomly editing 10% of the suppliers that correspond to each orderkey. Our error generation is similar to (bart)

with the difference that we also add errors using uniform distribution to evenly distribute the errors across the dataset, thereby affecting all queries. The errors that we inject are detectable by the constraints that we evaluate. The size of

lineorder table is 60MB in the original version, and ranges from 110MB to 2.6GB in the probabilistic version. To evaluate cases with fewer violations, we construct datasets with 20%, 40%, 60% and 80% of erroneous orderkeys. The size of the probabilistic version of those datasets is 250MB, 560MB, 1.3GB, and 1.8GB.

The hospital dataset (holoclean) comprises information about US hospitals. It contains 19 attributes, and is 5% erroneous. We use two versions with 1K and 100K entries, and sizes 300KB, 25MB respectively. The probabilistic versions have size 360KB and 26MB respectively. We use hospital to evaluate accuracy since its clean version exists.

The Nestle dataset includes information about food and drink products. Each product contains 19 attributes and involves dirty categories for product materials. We scale up the dataset by randomly adding duplicate entities from the domain of each attribute. We also add extra errors by randomly editing 10% of the category attribute values that correspond to each material. We use a 20MB and a 200MB version which contain 95% of conflicting entities. The size of the datasets in the probabilistic version is 40MB and 500MB respectively.

The historical air quality dataset (air_quality_analysis) contains air quality measurements for the U.S. counties. We use a subset of the hourly measurements in which we add errors to the FD :county_code,state_codecounty_name. We edit 10% of the county_names that correspond to a county_code,state_code. We add the errors to the non-frequent county_code,state_code pairs. We use two versions with 0.001% and 0.003% errors respectively which produce 30% and 97% violations respectively, the size of which is 2GB in the original version and 3.1GB and 4GB in the probabilistic versions.

We use response time and accuracy (when applicable) as metrics. Response time is the time taken to respond to the query and perform the cleaning task by providing probabilistic fixes. For accuracy, we measure precision (correct updates/total updates) and recall (correct updates/total errors).

7.1. SP queries response time

This section shows how Daisy performs compared to offline cleaning given a workload of SP queries. We measure the cost of both approaches given a) a FD, b) two overlapping FDs, and c) a DC. We evaluate all cases over SSB, by executing queries requesting information for a specific supplier/order, or for suppliers/orders in a given range. In all FD experiments, Daisy outputs the same results with the offline approach.

Single FD with varying selectivity of rule attributes. We examine how the orderkey and suppkey selectivity affects the response time of the cleaning task. We use three versions of lineorder with 5K, 10K, and 100K distinct orderkey values respectively, and three versions with 100, 1K, and 10K distinct suppkey values respectively. We use these selectivities since they involve extreme response time cases depending on the query. We clean violations of rule :orderkeysuppkey. We consider the worst-case scenario where each orderkey participates in a violation. We execute 50 non-overlapping queries, each with selectivity 2%. The workload accesses the whole dataset.

Figure 7. Cost when varying orderkey selectivity.
Figure 8. SP Cost when varying suppkey selectivity.

Fig. 8 shows the response time of Daisy and full cleaning when varying the orderkey selectivity. To maintain a fixed query selectivity, queries contain range filters over the of . We observe that as the selectivity increases, the response time of both approaches increases. However, on average, Daisy is faster than the offline approach. The difference is due to the fact that when combining cleaning with querying, query result relaxation restricts the number of comparisons required to repair the erroneous tuples by computing the correlated tuples. On the other hand, the offline approach traverses the dataset for each erroneous value, to compute the candidate values. We also observe that as the selectivity increases, the difference between the two approaches decreases because each erroneous cell ends up having more candidate values thereby increasing the value of Inequality (1) of Section 5.2.

Fig. 8 shows the response time of Daisy and offline cleaning when varying suppkey selectivity. To maintain a fixed query selectivity, queries contain range filters over the of . Daisy is faster despite the transitive closure it requires to detect the correlated values. The difference is due to the fact that when combining cleaning with querying, query result relaxation restricts the number of comparisons required to repair the errors by computing the correlated tuples. On the other hand, the offline approach traverses the dataset for each erroneous value to compute the candidates. When suppkey selectivity is smaller, the cost becomes higher since each erroneous suppkey might match with multiple orderkeys, thereby increasing the number of candidate values.

Fig. 10 evaluates the scenario in which applying cleaning offline outperforms incremental cleaning (Daisy without the cost model). We execute 90 queries over the lineorder version with 100K distinct orderkeys. The queries are non-overlapping, they involve equality and range conditions, and have random selectivities. Cleaning the whole dataset is more efficient in this case because the suppkey selectivity is low compared to the orderkey, thus each suppkey appears with multiple orderkeys throughout the dataset. Thus, a violating suppkey takes multiple candidate values, thereby increasing the update cost shown in Inequality (1). Still, we observe that overall, Daisy outperforms both the incremental as well as the offline cleaning. Daisy initially applies data cleaning incrementally, and then, by evaluating the total cost after each query, switches strategy and applies the cleaning task over the rest of the dataset. The total cost is lower than the offline approach because cleaning is applied over the remaining dirty part of the dataset.

Figure 9. Switching from incremental to full cleaning.
Figure 10. Cost when increasing number of rules.

Single rule vs. Multiple rules. In this experiment, we measure the response time in the presence of rules with overlapping attributes. We construct the dataset by joining lineorder with suppliers. The end result is a 67MB dataset in its raw form, and 2.8GB in its probabilistic form. We evaluate rules :orderkeysuppkey and :addresssuppkey; the address appears after joining the tables. The workload consists of 50 non-overlapping queries which access the whole dataset.

Fig. 10 shows the response time in the case where we examine only rule compared to examining both and . We observe that in both Daisy and the offline approach, response time increases when we clean errors of both rules instead of one, due to the extra work required for . When Daisy executes the queries, it identifies the corresponding correlated tuples for both rules. Then, Daisy fixes the errors based on the correlated tuples. Initially, the difference between one and two rules is but then drops to as we clean more data. On the other hand, offline cleaning separately fixes the errors of the address and orderkey since there might be different tuples involved in the violation of than those involved in . Thus, offline cleaning needs more traversals over the data.

Increasing number of violations. In this experiment we evaluate Daisy as we vary the number of violations. Specifically, we vary the erroneous orderkeys from 20% to 80%. We use the same query workload consisting of 50 non-overlapping SP queries with selectivity 2%.

Fig. 12 shows that in all cases Daisy outperforms the offline approach regardless of the number of erroneous entities. Daisy is faster due to the statistics that it precomputes to prune unnecessary checks. The statistics comprise the orderkeys that participate in a violation; it precomputes a group by based on the orderkey and calculates the size of each group. Then, at query time, when it accesses a specific orderkey, it checks whether it belongs to a dirty group. Thus, Daisy avoids detecting violations when the entity does not belong to the list of dirty values. We also observe that as errors increase, the difference between the two approaches is more significant. The difference stems from the fact that in the case of full cleaning, the number of iterations over the dataset is proportional to the number of detected erroneous groups in order to compute the probabilities of each candidate value. On the other hand, depending on the values that the query accesses, Daisy traverses the data once and brings the correlated tuples that correspond to multiple erroneous groups at the same time. Thus, as we increase the number of erroneous groups, offline cleaning performs more traversals, and thus becomes slower.

Figure 11. Cost with increasing number of violations.
Figure 12. Cost for DCs with inequality conditions.

Denial constraints. In this experiment we evaluate the cost in the presence of rules with inequality predicates. Specifically, we consider rule ,t:(.extended_price.extended_price &.discount.discount). We check the rule over the lineorder table in which we inject errors by editing the discount value of 10% of entries. We construct three different versions: with 0.2%, 2%, and 20% violations respectively. We construct each version by modifying the number of violations that the erroneous entities induce. We execute 60 SP range queries that are non-overlapping and access the whole dataset.

Fig. 12 presents the response time of both Daisy and the optimized offline approach. We observe that for the 0.2% and 2% versions, Daisy is faster because it restricts the size of the partitions of the matrix that are required for the computation of the cartesian product. The end result is 99% and 80% accurate respectively compared to the offline case. For the 20% case, Daisy predicts a 23% accuracy using the statistics, therefore it decides to clean the whole dataset, and thus has the same response time as the offline case and is 100% accurate compared to the offline approach.

Figure 13. Cost for join queries.
Figure 14. Cost for mixed workload.

7.2. Join queries response time

This section demonstrates how Daisy performs when Join queries appear in the workload.

Fig. 14 shows the response time of join queries using Daisy and the offline approach respectively. Daisy outperforms full cleaning for two reasons: First, similarly to the SP queries case, Daisy benefits from computing the set of correlated tuples, thereby restricting the number of comparisons. Second, Daisy benefits from incrementally updating the join result when extra tuples are added. On the other hand, offline cleaning performs a probabilistic join which is expensive.

Fig. 14 shows the time taken to execute a workload with both SP and Join queries. The lineorder table violates rule :orderkeysuppkey, and the suppliers violates rule :address suppkey. We use the scenario of Fig. 10, where we execute 90 queries over the 100K version of lineorder, and the suppkeys contain 500 distinct values. Both the SP and join queries are non-overlapping, involve equality and range conditions and have random selectivities. We observe that Daisy predicts that it is more efficient to clean the full dataset after 30 queries, and thus by penalizing some queries, overall it is faster than both incremental and full cleaning.

Fig. 15 compares the response time of three query workloads from the SSB family to evaluate how Daisy behaves with more complex queries. We use the same setup with Fig. 14. Q1 is a join between lineorder and suppliers and contains a range filter on the suppkey. Q2 additionally joins the result of Q1 with part and date tables and groups by year and brand. Q3 contains a fourth join with customer. All queries project the keys of the involved tables thus the probabilistic orderkey/suppkey attributes as well. We observe that regardless of the query complexity, since Daisy pushes down the cleaning operator, cleaning affects only the join between lineorder and suppliers. The breakdown of the cost of the overall plan showed that the time difference between Q1 and Q2,Q3 stems from the fact that in Q2,Q3 the initial join projects the extra attributes required for the following joins. Thus, cleaning is more expensive since Spark requires outer joins to split and stitch back the clean and dirty part of the query result.

Figure 15. Cost for complex queries of SSB workload.

7.3. Real-world scenarios

In this set of experiments, we compare Daisy against Holoclean, and we also measure the cost of executing a realistic exploratory analysis scenario.

Hospital: In this experiment, we evaluate the efficiency and accuracy of Daisy and Holoclean. We use rules :,:(. zip=.zip.city .city), :,:(.hospitalName=. hospitalName.zip.zip), :,:(.phone=.phone .zip.zip). To obtain a fair comparison, we execute Daisy on a single node, and disable the inference of Holoclean when measuring response time; we only obtain the candidate values for each cell. For accuracy, we apply Holoclean’s inference based on Daisy’s domain generation (DaisyH), and compare it with the original Holoclean. To integrate Daisy with Holoclean, we populate the cell_domain table that Holoclean uses with the candidate values that Daisy computes. We also report Daisy’s accuracy when selecting the most probable value (DaisyP). For accuracy we use the 1K version for which master data exists. For efficiency, we use version 100K.

Table 5 shows the precision, recall, and F1-measure for Daisy and Holoclean. Daisy executes a workload of 4 SP queries that access the whole dataset. Each tuple is accessed only once, and is cleaned at query time. For Holoclean we clean violations a priori and measure the accuracy of the corresponding attributes. We observe that both systems exhibit comparable accuracy. When not all rules are known, such as in the case of Holoclean performs better because it generates the domain based on quantitative statistics, whereas DaisyH uses the correlations driven by the dependencies. DaisyP performs worse since it does not exploit master data. However, when more rules are known, Daisy is more accurate because Holoclean prunes the domain of each value using a threshold for performance reasons. Thus, using Daisy’s optimizations, one can avoid trading accuracy at this level.

Table 6 shows the response time when cleaning violations of different rule combinations. Daisy outperforms both Holoclean and the full cleaning approach due to the optimizations that it allows. Holoclean exhibits higher response times since the tuples of the hospital dataset are highly correlated, thus it performs multiple comparisons to computes the candidate values. Also, Holoclean traverses multiple times the dataset for each erroneous group to compute the fixes. Thus, Daisy’s optimizations can be applied to the domain construction of Holoclean in an analysis-aware scenario.

Table 7 shows the benefit stemming from maintaining provenance to the original data and incrementally updating the probabilistic data in the case new rules appear. We measure the total cost by checking , , and . We compare the scenario where we execute Daisy and Holoclean three times, one for each rule set, with a single execution of Daisy that incrementally updates the probabilistic data. We evaluate the case where a user queries the whole dataset and executes the cleaning task, thus the cost of Daisy is equivalent to the offline cost. For Holoclean, we measure only the cost of the candidate fixes for each cleaning task. We observe that the single execution of Daisy outperforms the three separate executions since it can merge the probabilistic fixes by inducing only the overhead of merging the fixes.

+ ++ Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Holoclean 1 0.55 0.71 0.98 0.95 0.96 0.98 0.92 0.95 DaisyH 0.97 0.52 0.68 1 0.98 0.99 1 0.98 0.99 DaisyP 0.41 0.51 0.45 1 0.97 0.98 1 0.98 0.99 Table 5. Accuracy

Nestle exploratory analysis: Data scientists working for Nestle, often need to apply analysis to discover information about different coffee products. We simulate this scenario and execute a query workload of 37 SP queries in which the analyst requests the details of a given coffee product through the attribute. The dataset contains violations of the FD . represents the material out of which each product is made; in the case of coffee products it represents the type of beans. is the type of product.

Table 8 shows the response time of the analysis over the two versions of the dataset. In both cases, the queries access 40% of the dataset. We observe that in the smaller dataset (20MB), the difference in the response time stems only from the fact that the analysis accesses 40% of the dataset. However, when the dataset becomes bigger, the difference is more significant. Daisy is faster because the selectivity of the attribute is very small, and thus since it appears with multiple erroneous values, the full cleaning approach ends up iterating through the dataset multiple times.

Full cleaning 51 sec 49 sec 118 sec
Daisy 49 sec 40 sec 92 sec
Holoclean 1020 sec 1108 sec 1188 sec
Table 6. Response time when increasing number of rules.
Total
Daisy (3 executions) 51 sec 49 sec 118 sec 218 sec
Daisy (1 execution) 51 sec 41 sec 40 sec 132 sec
Holoclean 1020 sec 1108 sec 1188 sec 3316 sec
Table 7. Response time when increasing the number of rules. Daisy maintains provenance information and updates the probabilistic data based on the new rule without having to execute the task from scratch.

Air quality exploratory analysis: This scenario is similar to the analysis that data scientists perform in Kaggle (air_quality_analysis) where they need to observe how air pollution evolves over the years in the US states. To perform the analysis, an analyst checks the CO measurements at specific locations, one location per state. Thus, the query workload that the scientist performs consists of 52 queries each of which outputs the average CO measurement for a given county grouped by year.

Table 8 shows that offline cleaning is unable to terminate after a timeout of one day due to the aforementioned reason of having to perform multiple iterations, for each erroneous group over a bigger dataset in order to clean it.

Dataset Daisy Offline
Nestle (20MB) 2.9 min 3.97 min
Nestle (200MB) 26.8 min 8.5 hours
Air quality 30% 10.5 min -
Air quality 97% 49 min -
Table 8. Response time on realistic scenarios.

Summary. The optimizations at the executor level ensure that Daisy scales better than offline approaches by restricting the comparisons to clean the data. In addition, by maintaining a partially probabilistic dataset, Daisy can clean and execute queries faster than having to compute the full set of probabilities. Finally, the logical-level optimizations allow Daisy configure the optimal placement of cleaning operators, depending on the query workload and the errors.

8. Conclusion

Data scientists usually perform multiple iterations over a dataset in order to understand and prepare it for data analysis. Having to apply each cleaning task over the whole dataset each time is tedious and time-consuming. In addition, having data cleaning decoupled from data analysis increases human effort since data cleaning is a subjective process which highly depends on the data analysis that users need to perform.

This work introduces Daisy, a system that partially cleans the dataset through exploratory queries. Daisy integrates cleaning operators inside the query plan, and efficiently executes them over dirty data by providing probabilistic answers for the erroneous entities. We evaluate Daisy using both synthetic and real workloads, and show that it scales better than approaches that fully clean the dataset as an offline process.

References