Distance-based Data Cleaning: A Survey (Technical Report)

11/23/2020
by   Yu Sun, et al.
0

With the rapid development of the internet technology, dirty data are commonly observed in various real scenarios, e.g., owing to unreliable sensor reading, transmission and collection from heterogeneous sources. To deal with their negative effects on downstream applications, data cleaning approaches are designed to preprocess the dirty data before conducting applications. The idea of most data cleaning methods is to identify or correct dirty data, referring to the values of their neighbors which share the same information. Unfortunately, owing to data sparsity and heterogeneity, the number of neighbors based on equality relationship is rather limited, especially in the presence of data values with variances. To tackle this problem, distance-based data cleaning approaches propose to consider similarity neighbors based on value distance. By tolerance of small variants, the enriched similarity neighbors can be identified and used for data cleaning tasks. At the same time, distance relationship between tuples is also helpful to guide the data cleaning, which contains more information and includes the equality relationship. Therefore, distance-based technology plays an important role in the data cleaning area, and we also have reason to believe that distance-based data cleaning technology will attract more attention in data preprocessing research in the future. Hence this survey provides a classification of four main data cleaning tasks, i.e., rule profiling, error detection, data repair and data imputation, and comprehensively reviews the state of the art for each class.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2020

Learning Individual Models for Imputation (Technical Report)

Missing numerical values are prevalent, e.g., owing to unreliable sensor...
research
08/13/2016

An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

Techniques such as clusterization, neural networks and decision making u...
research
05/08/2020

A Survey on Sampling and Profiling over Big Data (Technical Report)

Due to the development of internet technology and computer science, data...
research
09/16/2019

Hierarchic Neighbors Embedding

Manifold learning now plays a very important role in machine learning an...
research
03/21/2021

Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data

Many software engineering research papers rely on time-based data (e.g.,...
research
10/01/2019

Distance-Based Approaches to Repair Semantics in Ontology-based Data Access

In the presence of inconsistencies, repair techniques thrive to restore ...
research
07/29/2018

Information Distance Revisited

We consider the notion of information distance between two objects x and...

Please sign up or login with your details

Forgot password? Click here to reset