Data cleaning and data integration are essential tasks in most data science and data analytics applications. Due to magnitude and the diversity of the problem, most enterprises rely on externally developed tools to unify, clean, and transform their data. Examples include data wrangling tools, such as Trifacta, data integration tools, such as Tamr , and many other transformation tools [3, 10, 27]. Most of these tools are not turn-key solutions, and have a large number of tunable parameters and configurations. Self-tuned tools are often limited in their applicability and need to make many assumptions on the underlying data, which are often violated in practice. Hence, commercial cleaning tools often rely on field engineers examining the data, e.g., by eyeballing a sample, or by trying different configurations and design choices, and examining their effect on the data involved in the cleaning project. This interaction is guided mainly by two human roles: the data owner who has knowledge of the domain and can judge the quality of the cleaning process; and the cleaning engineer who is familiar with the cleaning tool and understands the implications of the configuration parameters. This interactive model customizes the cleaning solution and ensures the most effective deployment, and has been widely adopted in numerous systems [19, 28, 31].
This model works well if the cleaning engineer has direct access to the data, but presents serious challenges when dealing with sensitive and private data (e.g., intelligence data collected by agencies, or health records in a hospital). In these settings, enterprises often adopt one of two approaches: (i) rely on internally developed tools that can be tuned using the enterprise employees; or (ii) use commercial tools, and ask external cleaning experts who are familiar with the tool to configure it using public data sets that are close in structure to the actual data. Both approaches allow the enterprise employees or the cleaning tools experts to work directly on some data, but they suffer from major shortcomings: internally developed tools present a high cost of internal development and maintenance; and working with mockup data often results in poor tuning and sub-optimal results when the solution is deployed on the actual data. In our interaction with multiple data cleaning companies and experts, these two challenges are currently severe pain points in cleaning private and sensitive data sets.
In this work, we propose a new model for cleaning private and sensitive data sets. We observe that engineers can accurately tune cleaning workflows by combining their expertise with access to just a (noisy) aggregate query interface, rather than access to the exact data (or a sample of it). We present comprehensive empirical evidence of this observation in Section 7, and highlight our key findings in Figure 1. We asked 8 humans (real cleaning engineers) and used 200 programs that simulate human cleaners (called robots; described in Section 7) to tune workflows for two exemplar data cleaning tasks of blocking and matching in entity resolution. Figure 0(a) shows that humans and robots alike achieve high task quality (recall for blocking and F1-score for matching, defined in Section 2) with access to an aggregate queries interface on the database with no error. Moreover, the same is true even when aggregate queries are answered with tolerable error (Figures 0(b) and 0(c)
) with the highest achievable and median accuracy degrading monotonically as the error in the queries increases. The variance in the quality results from the variance in decisions made by cleaners (in Figure0(a)), and the noise in the aggregates in Figures 0(b) and 0(c).
Inspired by these results, we propose DPClean, a system that allows external cleaning engineers tune cleaning workflows on sensitive datasets while ensuring strong privacy guarantees. Cleaning engineers explore the dataset and the space of configuration parameters for the cleaning task by adaptively posing aggregate queries to DPClean. Rather than thinking in terms of privacy bounds, the cleaning engineers are only required to specify an error tolerance on each of their queries. As the name suggests, DPClean translates queries with tolerance to query answering strategies that satisfy -differential privacy , a strong privacy guarantee that permits noisy but accurate answers to aggregate queries, while provably bounding the information leakage about any single record in the database. Noisier aggregates ensure stronger privacy guarantees (smaller ). DPClean allows the data owner to specify a bound on the privacy leakage about any single record to the cleaning engineer, and ensures that every sequence of queries it answers has a total privacy leakage below this bound.
The key technical contributions of this paper are :
DPClean provides the first solution (to the best of our knowledge) that enables general data cleaning activities on private data via a private and secure interaction between the data owner and the cleaning engineer, each expressing their constraints in the most natural and effective way to carry out the cleaning task.
Primitives: We propose a small set of aggregate query primitives that a cleaning engineer can use to privately explore the sensitive dataset. We formulate the semantics of error tolerance for each of these query types.
Translating Error to Privacy: We formulate and solve the novel problem of translating aggregate queries with error tolerances to differentially private mechanisms that answer these queries within the error bounds, but with the least privacy loss. This is in contrast to almost all prior work in differentially private algorithm design that assumes a constraint on privacy and minimizes error.
We conduct a thorough empirical evaluation to demonstrate the effectiveness and efficiency of DPClean for real data cleaning scenario on real data sets.
Organization. Section 2 establishes notation and running examples used in this work. Section 3 presents an overview of DPClean. The three key technical contributions of this work, namely aggregate query primitives, translating error to privacy and composition are presented in Sections 4, 5, and 6, respectively. Section 7 shows the evaluation of DPClean on real datasets and cleaning tasks.
2 Notation and Background
Throughout the paper we consider a single table , with schema . Each row is drawn from a domain consisting of attributes, . Our algorithms can be extended to relations with multiple tables, which is discussed in Section 9.
2.1 Differential Privacy
Definition 1 (()-Differential Privacy )
A randomized mechanism satisfies -differential privacy if
for any set of outputs , and any pair of neighboring databases such that and differ by adding or removing a record; i.e., .
Intuitively, smaller values of and result in stronger privacy guarantees as and are harder to distinguish from the output. When , this privacy guarantee is called -differential privacy.
Differential privacy has emerged as the state-of-the-art privacy notion since (i) it is a strong mathematical privacy guarantee that hides information of individual records, but allows statistics of the dataset to be learnt, (ii) it ensures privacy even in the presence of side information about the data, and (iii) most importantly it allows one to bound the information leakage across multiple data accesses. We will describe composition and postprocessing theorems of differential privacy  that are used to bound privacy across multiple accesses in Section 6. Differential privacy has seen adoption in a number of real products at the US Census Bureau [12, 22, 30], Google  and Apple . We did not choose semantically secure or property preserving encryption, as the former does not allow the cleaning engineer to learn anything about the data, and the latter is susceptible to attacks using side information .
2.2 Use Case: Entity Resolution
To illustrate the interactive process for data cleaning we consider entity resolution (also referred to as record linkage, entity matching, reference reconciliation or de-duplication) [8, 16], as an important data cleaning application, which aims to identify entities (or objects) referring to the same real-world entity. For concreteness, we will assume that the data owner has two set of records and with the same schema of attributes and the owner already posses a labeled training dataset with schema . We also assume that is constructed by (a) picking a sample of records from , (b) identifying for each , another record , and (c) for each pair , labeling the pair as either a match or a non-match (like in ). Throughout this paper, we will assume that every record in and appears at most times in the training dataset ( in our experiments). Let and denote the set of positive and negative examples in the training set.
A typical entity resolution solution usually involves two tasks: blocking and matching. Both these tasks are achieved by learning a boolean formula (e.g. in DNF) over similarity predicates. We express a similarity predicate as a tuple . First, is an attribute in the schema and is a transformation of the attribute value. Next, is a similarity function that usually takes a real value often restricted to , and is a threshold Given a pair of records , a similarity predicate returns either ‘True’ or ‘False’ with semantics:
For instance, the similarity predicate
would return true whenever the cosine similarity of the character q-gram representation of the Name attribute in two records is greater than. Blocking and matching tasks typically use a rich library of attribute and context dependent transformations and similarity functions. There are an exponential number of possible predicates , and an even larger number of formulas that can be constructed by combining them with boolean operators. Cleaning engineers judiciously limit the search space by their experience and domain expertise to create accurate cleaning workflows.
For blocking, our goal is to find a boolean formula that identifies a small set of candidate matching pairs that cover most of the true matches.
Definition 2 (Blocking Task)
Discover a boolean formula over similarity predicates that achieves high recall, and low blocking cost. Recall is the fraction of true matching pairs that are labeled true by .
Blocking cost is the fraction of pairs in that return true for .
A matcher is a boolean formula that identifies matching records with the least false positives and false negatives.
Definition 3 (Matching Task)
Discover a boolean formula over similarity predicates that achieves high recall and precision on . Precision measures whether classifies true non-matches as matches, and recall measures the fraction of true matches that are captured by .
where denotes the set of rows in that satisfy .
3 DPClean: System Overview
Figure 2 illustrates the architecture of DPClean. The data owner and the cleaning engineer are the two parties involved in this system. The data owner owns the private dirty data, while the cleaning engineer is on the other side of the ‘firewall’ with no direct access to the dirty data. A cleaning engineer can interact with the sensitive data through the local exploration module, and the privacy engine answers these queries and tracks the privacy loss.
3.1 Local Exploration
Local exploration is a human-guided process, in which a cleaning engineer explores the dirty data with the goal of judiciously choosing parameters for automatic cleaning softwate in the data cleaning workflow.
Consider a blocking task (Def. 2) where the cleaner’s goal is to find a disjunction of predicates from the set of all predicates . A typical cleaning engineer interacts with the data to limit the space of predicates as follows:
Example 1 (Data Profiling)
The cleaning engineer starts with profiling the data, for example, to quantify the number of missing values for each attribute, or to know the distribution of values of some attributes.
Example 2 (Picking Predicates)
Based on the acquired data profile, the cleaning engineer chooses a suitable subset of attributes (e.g., ones with least number of nulls), a subset of transformations (e.g., lower case, character -grams), a subset of similarity functions (e.g., Jaccard or Cosine similarity) along with reasonable thresholds to construct a set of candidate predicates to use to build the blocker.
Example 3 (Evaluating Predicates)
Suppose and have been chosen as two predicates in (from Example 2). The cleaner may want to measure the recall and blocking cost of , and .
Example 4 (Updating Configuration)
Suppose has a recall of and blocking cost of , and has a recall of and a blocking cost of . The cleaner may choose to include in the solution and not (as the latter has a high cost). On the other hand, if can catch of the matching pairs and of the non-matching pairs, the cleaner may choose to retain in the solution.
Some of these aforementioned example actions – e.g. profiling and predicate evaluation – can be achieved by the cleaning engineer posing aggregate queries over the (training) data. Other actions – e.g. selecting attributes or transformation, and setting thresholds on recall/cost – need the expertise of a human cleaning engineer. These two types of actions are formulated as follows.
Type-I: Queries with Tolerance. This type of actions allow the cleaning engineer to query aggregate statistics on the database with some error tolerance. Given an aggregate query , we let denote a general distance function over the output space of the query that represent the difference between the true answer of a query over , referred to as , and a noisy answer to . We define error tolerance on query as:
Definition 4 (-.type tolerance)
Given a query, , and a distance function over the query output, . We say a mechanism satisfies -q.type tolerance for and if for any table
We fix to be a very small value throughout the paper. DPClean supports the following query types: linear counting queries, linear counting queries with conditions and linear top-k queries. The query types and their tolerances are defined in Section 4. We denote a sequence of query-based actions by (and keep the same for all queries).
Type-II: Smart Cleaning Choices. This type of actions are the smart choices made by the cleaning engineer based on the knowledge of the engineer (including the sequence query answers learned so far in addition to the initial expertise and the preference of this engineer). Consider Example 2, the space of choices on predicates are exponential in terms of the number of attributes, transformations, similarity functions, and possible thresholds. Different cleaning engineers can choose different sets of candidate predicates, and even have different orders to evaluate these predicates. Moreover, there exist cleaner dependent criterion on whether to keep or prune a predicate in Example 4 and these criterion vary among engineers. We denote these engineer specific actions by , where corresponds to a set of choices made by the engineer after the th type-I action for .
Cleaning Strategy. During the interactive process, we observe a sequence of type-I and type-II actions interleaved . We use a strategy to denote a class of local explorations that using the same set of type-I actions (query types) but different type-II actions. Figure 3 shows two strategies iterates over example actions given in Example 1 to 4. These two strategies are different as they use different set of type-I actions (labeled by ), though they share the same type-II actions (labeled by ). In particular, the queries in the strategy shown in Figure 2(a) are count queries, while the queries in Figure 2(b) are top-k and counting with condition queries respectively. The same engineer who applies the second strategy will see less information and may continue the exploration differently. If changing type-II actions of the strategy instance in Figure 2(a), (e.g., changes to ‘Cleaner chooses a criterion for predicate p: if it catches of remaining matches and of remaining non-matches), the resulted exploration still belongs to the same strategy. The differences in type-II actions are mainly due to the differences in the expertise or preference of cleaning engineers.
3.2 Privacy Engine
The privacy engine in DPClean ensures that every sequence of queries (with tolerances) posed by the cleaning engineer is answered while satisfying -differential privacy, where and are privacy budget parameters specified by the data owner. The privacy engine serves two important functions, and its operations are summarized in Algorithm 1.
The first component of this module is an error tolerance to privacy translator (Line 5) that takes each query and its tolerance requirement , and translates it into a differentially private (noise) mechanism or simply represented as . The translation is such that: (i) satisfies -.type tolerance (Def. 4), (ii) satisfies -differential privacy with minimal privacy loss, i.e., the same mechanism with different noise parameters cannot both simultaneously achieve -.type tolerance and -differential privacy for . Note that constraints (i) and (ii) can be individually ensured by releasing the true answer or releasing an extremely noisy answer, respectively. Satisfying both the constraints simultaneously is a key technical innovation in this paper, and we propose a set of translation algorithms with minimal privacy loss in Section 5. While our algorithms translate individual queries with tolerances to privacy mechanisms, translating a group of queries simultaneously can result in higher accuracy (and lower privacy loss) – we defer this interesting direction for future work.
The second component named as privacy analyzer (Line 6-12) analyzes the accumulated privacy cost. Given a sequence of mechanisms already executed by the privacy engine that satisfy an overall -differential privacy, the privacy engine calls function (Line 6
) to estimate the privacy loss that would be incurred by running the next mechanism(to answer the next query). If exceeds the privacy budget , then the privacy does not answer the query, and the cleaning engineer might either halt, or ask a query with a higher error tolerance. If , then the mechanism will be executed and will be returned to the cleaning engineer. The privacy engine then calls to compute the actual privacy loss. In most cases, will be the same as , but in some cases the two can be different. In some of our tranlsated mechanisms , different execution paths have different privacy loss. represents the worst case privacy loss across all execution paths (computed before running ), while represents the actual privacy loss (after running ). The privacy analyzer guarantees that the execution of any sequence of mechanisms () before it halts is -differential privacy (see Section 6).
The following sections will describe the set of aggregate query primitives supported by DPClean, the privacy translator and the privacy analyzer.
4 Privacy Primitives
This section formalizes the aggregate query primitives supported in DPClean: (i) linear counting queries, (ii) linear counting queries with conditions, and (iii) top-k linear counting queries. We consider these primitives because: (1) the definition of tolerance is different for these queries, and (2) the differentially private algorithms that support these primitives are different.
4.1 Linear Counting Query (LC)
There is much work on answering linear counting queries under differential privacy [20, 13]. An LC query on a table , denoted by , returns the number of rows in the table that satisfy a boolean predicate . An LC query can be expressed in SQL as
Given an LC query , we define the distance between the true output and an approximate answer as the absolute difference . The error tolerance for a mechanism for answering an LC query is defined as:
Definition 5 (-LC tolerance)
Given a linear counting query , we say a mechanism satisfies -LC tolerance, if for any table ,
LC queries can be used to guide several cleaning tasks.
Example 5 (LC Queries for NULLs)
Counting the number of rows with NULL for an attribute is an LC query on a base table which counts the number of rows in where the given attribute has ‘NULL’ value. If the true answer is 10000, and the tolerance is , then the approximate answer should be within with probability.
The sensitivity of an LC query on a base table, or the max change in its output due to adding or removing a tuple in the input, is always . LC queries can also be posed on views over the base table, and its sensitivity is bounded by the stability of the view. Stability, defined as the number of tuples added or removed from a view due to adding or removing a single tuple in the base table, can be computed using well known methods [23, 14]. In all our experiments, queries are either on base tables or views with stability 1; hence, LC queries always have sensitivity 1.
Example 6 (LC Queries for Blocking)
Given a dataset , let be a training set that consists of pairs of records in and whether they are matching or non-matching records (as described in Sec. 2.2). Then, recall of the blocking predicate can be computed by first answering a LC over the training set that counts the number of matching records, , that satisfy the predicate , and then dividing by the number of matching records in , a known constant. Note that if each record in appears at most times in , then the stability of (and the sensitivity of ) is at most .
4.2 LC Query with Condition (LCC)
Given a table , a linear counting with condition (LCC) query returns a binary output () based on comparing an LC query output with a given threshold . The comparison operator can be ‘’. The error tolerance for a mechanism answering an LCC query is:
Definition 6 (-LCC tolerance)
A mechanism satisfies -LCC tolerance for an LCC query if for any table ,
Error tolerance for other are defined analogously.
Example 7 (LCC Queries for NULLs)
Continuing with Example 5, instead of querying for the number of NULLs for an attribute, the cleaner can ask whether the number of NULLs is greater than some threshold, say 1000. Given a tolerance of , then when the answer is , with 95% probability the true count is ; and when the answer is , with 95% probability the true count is ;
Similarly, LCC queries can be used to check if the quality of a blocking scheme or a matching predicate reaches a threshold.
4.3 LC Top-K Query (LCT)
Given a table , a top- linear counting query takes in a set of boolean formulae , and outputs the top- boolean formulae that are satisfied by the largest (or the least) number of rows in the table, denoted by (or , respectively).
Definition 7 (-LCT tolerance)
Given a top- linear counting query , we say that a mechanism with output satisfies -LCT tolerance if for any table ,
where is the answer to the largest LC query.
This definition is equivalent to saying that with high probability all boolean formulae in the output of the mechanism have , and every with is in the output.
Example 8 (LCT Queries for Missing Values)
Again continuing with Example 5, suppose the cleaner wants to prune out attributes with the most number of NULLs. Rather than using LC queries counting the number of NULLs in each attribute, the cleaner could directly ask an LCT query. For instance, suppose have NULLs respectively. The top attributes with the most number of missing values are . The number of NULLs in , the th largest LC query answer, is 8000. Given a tolerance of , then with high probability, attributes selected in the output should have a true count larger than 8000-10, and attributes with true count larger than 8000+10 should be appear in the output.
5 Translating Error to Privacy
In this section, we provide translation algorithms for the primitives including LC, LCC, and LCT defined in the previous section. Given a query , where .type on a table with -.type tolerance requirement for , we propose a differentially private mechanism , or simply denoted as , such that (i) is answered with mechanism that satisfies -.type tolerance, (ii) mechanism satisfies -differential privacy, with minimal privacy loss. i.e., the same mechanism with different noise parameters cannot both simultaneously achieve -.type tolerance and -differential privacy for . Note that this section considers mechanisms for one query at a time. The overall privacy of a sequence of queries will be presented in Section 6. We first present Laplace Mechanism based translations for all primitives. The privacy loss of mechanisms resulting from these translations depend only on the query type and tolerance, and not on the input data (Sec 5.1). We then discuss data-dependent translations for LCC queries that result in mechanisms with lower privacy loss for certain datasets and queries (Sec 5.2).
5.1 Laplace Mechanism based Translation
5.1.1 Laplace Mechanism For LC
The Laplace Mechanism is a classic differentially private algorithm that is widely used as a building block.
Definition 8 (Laplace Mechanism)
Let be a LC query on database . The Laplace mechanism releases . Noise is drawn from the Laplace distribution with parameters that has a pdf .
The Laplace mechanism run with satisfies -differential privacy , if the sensitivity of the query is bounded by . For ease of presentation, we LC queries are answered on base tables or views with stability 1, resulting in . The Laplace distribution has a mean of 0 and a variance of , and . As the noise parameter increases, the noise is more likely to be outside the range .
DPClean translates LC queries with tolerance into the Laplace mechanism with parameter
. Based on the properties of the Laplace random variable, it is easy to see that it satisfies the tolerance requirement and has minimal privacy loss.
Given a linear counting query , for table , adding noise to , where , denoted by , can achieve -LC tolerance, i.e. with the minimal -differential privacy cost of .
5.1.2 Laplace Comparison Mechanism for LCC
An LCC query can also be answered using Laplace noise, similar to LC. As shown in Algorithm 2, noise drawn from Laplace distribution with parameter is added to the difference between and . If the noisy difference is greater than 0, the mechanism outputs ‘True’; otherwise, ‘False’. This mechanism, referred as Laplace comparison mechanism has the following property. All proofs in this section are deferred to the Appendix B.
Given a linear counting query with condition for any , the Laplace comparison mechanism (Algorithm 2) denoted by , can achieve -LCC tolerance with minimal -differential privacy cost of .
Alternatively, the cleaning engineer can pose a linear counting query with -LC tolerance via DPClean, and then use the noisy answer of to learn locally. This approach adds a smaller expected noise to , and hence achieves -LCC tolerance, i.e.,
Using the output of a Laplace mechanism to answer can achieve -LCC tolerance, where .
This approach also allows a cleaning engineer to make more local decisions, for example, how much to adjust the threshold of a similarity function that consumes a larger privacy cost of compared to with cost of .
5.1.3 Laplace Top- Mechanism for LCT
Given a set of boolean formulae LCT aims to find top- boolean formulae that are satisfied by the largest number of rows in the table. This primitive can be treated as linear counting queries and be answered as shown in Algorithm 3. Each LC query is perturbed with noise drawn from , where . These boolean formulae are then sorted based on their corresponding noisy counts in descending order and the first boolean formulae are outputted. This approach, referred as Top-k Laplace Mechanism, has the following property.
Given a top-k linear counting query, , for any table , Laplace top- mechanism (Algorithm 3) denoted by , can achieve -LCT tolerance with minimal -differential privacy cost, where .
Alternatively, the cleaning engineer can pose a list of linear counting query with -LC tolerance for each linear counting query, and then answer the top- linear counting query locally. However, this approach does not achieve -LCT tolerance, i.e.,
Using the outputs of for to answer can achieve -LCT tolerance, where .
5.2 Data Dependent Translation for LCC
The translation mechanisms shown above are all data-independent as the derivation of the noise parameter and hence the privacy cost only depends on the query type and the tolerance requirement. Given the same query and tolerance requirement, it is possible to achieve smaller privacy cost using a different mechanism for certain datasets. In this section, we use linear counting query with condition LCC, , as an example to explore such so-called data-dependent translations.
Intuitively, when is much larger (or smaller) than , then a much larger (smaller resp.) noise can be added to without changing the ‘True’ or ‘False’ decision of the system. Consider the following example.
Consider an LCC query , where . To achieve tolerance for this query, where , the Laplace comparison mechanism requires a privacy cost of by Theorem 2, regardless of input . Suppose . In this case, is much larger than the threshold , and the difference is times of the accuracy bound . Hence, even when applying Laplace comparison mechanism with a privacy cost equals to wherein the noise added is bounded by with high probability , the noisy difference will still be greater than 0 with high probability.
This is an example where a different mechanism rather than achieves the same LCC tolerance with a smaller privacy cost. Note that the tightening of the privacy cost in this example requires to know the value of , and thus this privacy cost estimation is a data dependent approach. We need to ensure this step is also differentially private and propose two approaches next.
5.2.1 LCM with Poking
The first approach is summarized in Algorithm 4, named as Laplace Comparison Mechanism with Poking (LCMP). This algorithm first computes the privacy cost of based on Theorem 2, denoted by (Line 1). It then chooses to run LCM with a small fraction of this privacy cost: i.e., it adds to the difference , with (Line 2,3). If the noisy difference is too large (Line 5), then LCMP returns ‘True’. If it is too small (Line 6), LCMP return ‘False’. In both these cases, LCMP incurs a fraction of the privacy loss of LCM. If the noisy difference is neither too small or too large, it runs LCM (Line 10), and incurs an additional privacy loss of .
Given a LCC query , for any table , LCM with Poking (Algorithm 4), denoted by achieves -LCC tolerance, and satisfies -differential privacy cost, where .
Note that LCMP has a higher privacy loss than LCM in the worst case. However, if LCMP returns in either Line 6 or Line 8, then the privacy loss is much smaller, and it occurs often in our experiments. The privacy engine (Algorithm 1) would use the worst case privacy loss to decide whether to answer a query using LCMP (in estimateLoss), but use the actual loss (which could be much smaller) to compute the overall loss (in analyzeLoss) if LCMP has been run.
5.2.2 LCM with Multi-Poking
In the Laplace Comparison Mechanism with Poking (Algorithm 4), the prepaid privacy budget needs to be specified, but it is difficult to determine a value without looking at the query answer. To tackle this challenge, we propose an alternative approach that allows of pokes with increasing privacy cost. This approach is summarized in Algorithm 5, as Laplace Comparison Mechanism with Multi-Poking (LCMMP). This approach first computes the privacy cost if all pokes are needed, . The first poke is the same as LCMP which checks if the noisy difference is sufficiently greater (or smaller) than the tolerance for the current privacy cost. If this is true (Lines 8-12), then an answer ‘True’ (or ‘False’) is returned; otherwise, the privacy budget is relaxed with additional . At th iteration, instead of sampling independent noise, we apply the NoiseDown Algorithm  (details refer to Algorithm 6 in Appendix) to correlate the new noise with noise from the previous iteration. In this way, the privacy loss of the first iterations is , and the noise added in the th iteration is equivalent to a noise generated with Laplace distribution with privacy parameter . This approach allows the data cleaner to learn the query answer with a gradual relaxation of privacy cost. This process repeats until all is spent. We show that Algorithm 5 achieves both tolerance and privacy requirements.
Given a LCC query , for any table , LCM with Multi-Poking (Algorithm 5), denoted by , achieves -LCC tolerance with -differential privacy, where .
Like LCMP, the worst case privacy loss of LCMMP is larger than that of LCM, but this mechanism may stop before is used up, and hence it potentially saves privacy budget for the subsequent queries. In fact, the privacy loss of LCMMP can be a fraction of LCM’s privacy loss, if the mechanism returns in the first iteration.
In DPClean, the default translation algorithms for LC, LCC, and LCT are respectively LM, LCM, and LTM described in Section 5.1. We empirically show these optimizations for LCC, but when to turn them on is an interesting future direction. Moreover, Laplace noise is used as an example for translation which achieves -differential privacy per query in this section. Other type of noises are also possible, such as Guassian noise which provides -differential privacy per query, and can be adapted into this system.
6 Privacy Analyzer
Privacy analyzer is the second component in the privacy engine that analyzes the accumulated privacy cost of all interactive process. There are two functions called by this component as shown in Algorithm 1, (i) (Line 6) to estimate the privacy loss that would be incurred by running the next mechanism (to answer the next query); (ii) to compute the actual privacy loss after running . The output of these two functions are the same in most cases, except when data-dependent translation algorithms are applied, such as Algorithms 4 and 5 which have different execution paths and hence different privacy loss. represents the worst case privacy loss across all execution paths (computed before running ) if running , while represents the actual privacy loss (after running ). We show two types of privacy composition techniques for a sequence of differentially private algorithms.
6.1 Sequential Composition
We first present sequential composition, a simple but useful composition technique.
Theorem 8 (Sequential Composition )
Let and be algorithms with independent sources of randomness that ensure - and -differential privacy respectively. An algorithm that outputs both and ensures -differential privacy.
Given a sequence of data independent translation mechanisms shown in Section 5.1 that satisfy differential privacy with cost of respectively, both functions and output . For data dependent translation mechanisms, considers the worst privacy loss of these mechanisms. For instance, if is LCM with poking (Algorithms 4), estimateLoss increments the privacy loss by (the privacy loss derived in Theorem 6) to obtain while analyzeLoss considers the actual privacy loss: if is not called (Line 10), then the actual privacy loss increments by ; otherwise, the actual privacy cost increments by . Similarly, if is LCM with multi-poking (Algorithm 5), then estimateLoss increments the privacy loss by while analyzeLoss considers only if the algorithm stops at loop .
6.2 Advanced Composition
Advanced composition techniques  allows the privacy parameters to degrade more slowly than sequential composition by considering the privacy loss as a random variable rather than as a fixed worst case cost. DPClean uses the Moments Accountant technique [2, 24] summarized below. Unlike sequential composition, which applies to blackbox mechanisms, advance composition techniques requires knowledge of the mechanisms (and the noise distributions used within).
The sequence of differentially private mechanisms run by the privacy engine can be considered as an instance of adaptive composition which is modeled by letting the auxiliary input of the th mechanism be the output of all the previous mechanisms, i.e., The moment accountant keeps track of a bound on the moments of the privacy loss random variable of each mechanism defined below.
For neighboring databases , a given mechanism , an auxiliary input , the privacy loss of outputting can be captured using the random variable:
If satisfies -DP, then with probability , but could be much smaller.
Let called the th moment be the log of the moment generating function of evaluated at :
The privacy loss of a mechanism requires to bound all possible , i.e.,
where the maximum is taken over all possible aux and all the neighboring databases .
Theorem 9 (Moments Accountant )
Let be defined as above. Then
Composability: Suppose that a mechanism consists of a sequence of adaptive mechanisms where Then, for any , .
Tail bound: The mechanism is -differentially private for , for any .
By the theorem above, the privacy analyzer only needs to bound at each step and sum them to bound the moments of the sequence of mechanisms . Then the tail bound can be used to convert the moment bound to the -differential privacy guarantee. The analysis of a single Laplace noise based Laplace Mechanism  that adds leads to a privacy loss
when ; when . When goes to , goes to . As the translation mechanisms shown in Section 5 are all based Laplace noise, a sequence of these mechanisms can be analyzed as a sequence of adaptive Laplace mechanisms that add noises to the corresponding linear counting queries. In particular, the Laplace top- mechanism (Algorithm 3
) for LCT draws a vector of noisesfrom , where can be shown as .
Similar to sequential composition, the outputs of the two functions estimateLoss and analyzeLoss are the same if all mechanisms are data independent. When data-dependent algorithms are run, these two functions vary in output. Suppose is LCM with poking (Algorithm 4) that may run only one Laplace mechanism with or two Laplace mechanisms. Function estimateLoss considers the worst case where two Laplace mechanisms are run while function analyzeLoss depends on the execution path. If only one Laplace mechanism was run, then the moments accountant in analyzeLoss considers the privacy loss of this Laplace mechanism alone, and the privacy cost can be saved for later mechanisms. Similarly, Laplace mechanism with multi-poking (Algorithm 5) may also stops at before the maximum privacy budget is used up. As the noises drawn before this mechanisms are all correlated and the last noise follows the distribution of , analyzeLoss considers the privacy loss of as the increment to the actual privacy loss, while estimateLoss considers the privacy loss of as the worst case before running .
6.3 Overall Privacy Guarantee
We show that the interactive process by DPClean satisfies the privacy constraint specified by the data owner.
Theorem 10 (Privacy Guarantee of DPClean)
The interactive process by DPClean including privacy engine represented by Algorithm 1 and local exploration satisfies -differential privacy.
(sketch) When , no mechanism has been executed yet, and hence , therefore the interactions are definitely -differentially private. Suppose is -differential privacy with an actual privacy loss . Given a new mechanism is proposed, the function estimateLoss considers the worst privacy loss of regardless the data and outputs . This step does not cost additional budget. If the privacy analyzer decides to run this mechanism as , then the actual privacy loss outputted by analyzeLoss after executing is which should be smaller than and hence no more than . If the privacy analyzer decides to deny the query, the actual privacy cost remains unchanged to . By induction, the actual privacy cost is always no more than . Moreover, the local exploration is a post-processing step, called as post-processing immunity , which does not consume additional privacy cost. Therefore, the interactive process by DPClean satisfies -differential privacy.
7 Empirical Evaluations
This section evaluates the usability and effectiveness of DPClean with real data cleaning scenarios on real data sets. Our experiments demonstrate the following:
7.1 Experiment Setup
Cleaning Tasks & Datasets. The cleaning tasks for entity resolution defined in Section 2.2 including blocking and matching are considered in this evaluation. Two datasets  are used: restaurants and citations (Table 1). For each dataset, the cleaning engineer would like to identify entities from two set of records and under the same schema with attributes . The training data for entity resolution tasks is sampled with the method described in Section 2.2 and half of the training record pairs are positive examples. is a view over the base tables, and we make it have stability equal to 1 by ensuring that each record in the base tables appears exactly once in . Thus, the sens