Private Exploration Primitives for Data Cleaning

12/29/2017
by   Chang Ge, et al.
0

Data cleaning is the process of detecting and repairing inaccurate or corrupt records in the data. Data cleaning is inherently human-driven and state of the art systems assume cleaning experts can access the data to tune the cleaning process. However, in sensitive datasets, like electronic medical records, privacy constraints disallow unfettered access to the data. To address this challenge, we propose an utility-aware differentially private framework which allows data cleaner to query on the private data for a given cleaning task, while the data owner can track privacy loss over these queries. In this paper, we first identify a set of primitives based on counting queries for general data cleaning tasks and show that even with some errors, these cleaning tasks can be completed with reasonably good quality. We also design a privacy engine which translates the accuracy requirement per query specified by data cleaner to a differential privacy loss parameter ϵ and ensures all queries are answered under differential privacy. With extensive experiments using blocking and matching as examples, we demonstrate that our approach is able to achieve plausible cleaning quality and outperforms prior approaches to cleaning private data.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset