Improve3C: Data Cleaning on Consistency and Completeness with Currency

07/31/2018
by   Xiaoou Ding, et al.
0

Data quality plays a key role in big data management today. With the explosive growth of data from a variety of sources, the quality of data is faced with multiple problems. Motivated by this, we study the multiple data quality improvement on completeness, consistency and currency in this paper. For the proposed problem, we introduce a 4-step framework, named Improve3C, for detection and quality improvement on incomplete and inconsistent data without timestamps. We compute and achieve a relative currency order among records derived from given currency constraints, according to which inconsistent and incomplete data can be repaired effectively considering the temporal impact. For both effectiveness and efficiency consideration, we carry out inconsistent repair ahead of incomplete repair. Currency-related consistency distance is defined to measure the similarity between dirty records and clean ones more accurately. In addition, currency orders are treated as an important feature in the training process of incompleteness repair. The solution algorithms are introduced in detail with examples. A thorough experiment on one real-life data and a synthetic one verifies that the proposed method can improve the performance of dirty data cleaning with multiple quality problems which are hard to be cleaned by the existing approaches effectively.

READ FULL TEXT
research
02/24/2022

Consistent data fusion with Parker

When combining data from multiple sources, inconsistent data complicates...
research
08/27/2021

Cleaning Inconsistent Data in Temporal DL-Lite Under Best Repair Semantics

In this paper, we address the problem of handling inconsistent data in T...
research
10/01/2019

Distance-Based Approaches to Repair Semantics in Ontology-based Data Access

In the presence of inconsistencies, repair techniques thrive to restore ...
research
01/01/2019

Probery: A Probability-based Incomplete Query Optimization for Big Data

Nowadays, query optimization has been highly concerned in big data manag...
research
12/26/2017

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come u...
research
01/13/2022

Certifiable Robustness for Nearest Neighbor Classifiers

ML models are typically trained using large datasets of high quality. Ho...
research
01/02/2020

Complexity and Efficient Algorithms for Data Inconsistency Evaluating and Repairing

Data inconsistency evaluating and repairing are major concerns in data q...

Please sign up or login with your details

Forgot password? Click here to reset