A probabilistic database approach to autoencoder-based data cleaning

06/17/2021
by   R. R. Mauritz, et al.
0

Data quality problems are a large threat in data science. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data and uses it as evidence to identify and correct doubtful values. We apply a probabilistic database approach to represent weak and strong evidence for attribute value repairs. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise (i.e., data quality problems) from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance.

READ FULL TEXT
research
04/26/2021

Provenance-based Data Skipping (TechReport)

Database systems analyze queries to determine upfront which data is need...
research
06/20/2022

Autoencoder-based Attribute Noise Handling Method for Medical Data

Medical datasets are particularly subject to attribute noise, that is, m...
research
06/21/2021

Facilitating team-based data science: lessons learned from the DSC-WAV project

While coursework provides undergraduate data science students with some ...
research
01/21/2018

A Formal Framework For Probabilistic Unclean Databases

Traditional modeling of inconsistency in database theory casts all possi...
research
09/29/2021

An epistemic approach to model uncertainty in data-graphs

Graph databases are becoming widely successful as data models that allow...
research
11/09/2018

Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence

In this paper we introduce evidence transfer for clustering, a deep lear...
research
12/24/2018

Motion Blur removal via Coupled Autoencoder

In this paper a joint optimization technique has been proposed for coupl...

Please sign up or login with your details

Forgot password? Click here to reset