Pattern-Driven Data Cleaning

12/26/2017
by   El Kindi Rezig, et al.
0

Data is inherently dirty and there has been a sustained effort to come up with different approaches to clean it. A large class of data repair algorithms rely on data-quality rules and integrity constraints to detect and repair the data. A well-studied class of integrity constraints is Functional Dependencies (FDs, for short) that specify dependencies among attributes in a relation. In this paper, we address three major challenges in data repairing: (1) Accuracy: Most existing techniques strive to produce repairs that minimize changes to the data. However, this process may produce incorrect combinations of attribute values (or patterns). In this work, we formalize the interaction of FD-induced patterns and select repairs that result in preserving frequent patterns found in the original data. This has the potential to yield a better repair quality both in terms of precision and recall. (2) Interpretability of repairs: Current data repair algorithms produce repairs in the form of data updates that are not necessarily understandable. This makes it hard to debug repair decisions and trace the chain of steps that produced them. To this end, we define a new formalism to declaratively express repairs that are easy for users to reason about. (3) Scalability: We propose a linear-time algorithm to compute repairs that outperforms state-of-the-art FD repairing algorithms by orders of magnitude in repair time. Our experiments using both real-world and synthetic data demonstrate that our new repair approach consistently outperforms existing techniques both in terms of repair quality and scalability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2017

Computing Optimal Repairs for Functional Dependencies

We investigate the complexity of computing an optimal repair of an incon...
research
02/24/2022

Consistent data fusion with Parker

When combining data from multiple sources, inconsistent data complicates...
research
08/30/2017

The Complexity of Computing a Cardinality Repair for Functional Dependencies

For a relation that violates a set of functional dependencies, we consid...
research
04/05/2020

Learning Over Dirty Data Without Cleaning

Real-world datasets are dirty and contain many errors. Examples of these...
research
08/25/2022

LinCQA: Faster Consistent Query Answering with Linear Time Guarantees

Most data analytical pipelines often encounter the problem of querying i...
research
07/08/2020

T-REx: Table Repair Explanations

Data repair is a common and crucial step in many frameworks today, as ap...
research
07/31/2018

Improve3C: Data Cleaning on Consistency and Completeness with Currency

Data quality plays a key role in big data management today. With the exp...

Please sign up or login with your details

Forgot password? Click here to reset