Learning Over Dirty Data Without Cleaning

04/05/2020
by   Jose Picado, et al.
0

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/29/2021

An epistemic approach to model uncertainty in data-graphs

Graph databases are becoming widely successful as data models that allow...
research
08/21/2023

DataVinci: Learning Syntactic and Semantic String Repairs

String data is common in real-world datasets: 67.6 1.8 million real Exce...
research
06/15/2022

On the complexity of finding set repairs for data-graphs

In the deeply interconnected world we live in, pieces of information lin...
research
07/17/2022

Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

Data cleaning often comprises outlier detection and data repair. Systema...
research
12/26/2017

Pattern-Driven Data Cleaning

Data is inherently dirty and there has been a sustained effort to come u...
research
08/21/2020

Spitz: A Verifiable Database System

Databases in the past have helped businesses maintain and extract insigh...
research
05/24/2017

On Patterns and Re-Use in Bioinformatics Databases

As the quantity of data being depositing into biological databases conti...

Please sign up or login with your details

Forgot password? Click here to reset