Batchwise Probabilistic Incremental Data Cleaning

11/09/2020
by   Paulo H. Oliveira, et al.
26

Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to answer analytical questions. Hence, there is a need for more effective and efficient data cleaning solutions, which, not surprisingly, is rife with theoretical and engineering problems. This report addresses the problem of performing holistic data cleaning incrementally, given a fixed rule set and an evolving categorical relational dataset acquired in sequential batches. To the best of our knowledge, our contributions compose the first incremental framework that cleans data (i) independently of user interventions, (ii) without requiring knowledge about the incoming dataset, such as the number of classes per attribute, and (iii) holistically, enabling multiple error types to be repaired simultaneously, and thus avoiding conflicting repairs. Extensive experiments show that our approach outperforms the competitors with respect to repair quality, execution time, and memory consumption.

READ FULL TEXT

page 13

page 16

page 17

page 18

page 19

page 20

page 21

page 22

research
02/08/2019

EILearn: Learning Incrementally Using Previous Knowledge Obtained From an Ensemble of Classifiers

We propose an algorithm for incremental learning of classifiers. The pro...
research
11/23/2016

iCaRL: Incremental Classifier and Representation Learning

A major open problem on the road to artificial intelligence is the devel...
research
04/20/2019

Mining Rules Incrementally over Large Knowledge Bases

Multiple web-scale Knowledge Bases, e.g., Freebase, YAGO, NELL, have bee...
research
10/16/2020

Class-incremental Learning with Pre-allocated Fixed Classifiers

In class-incremental learning, a learning agent faces a stream of data w...
research
09/11/2023

MultIOD: Rehearsal-free Multihead Incremental Object Detector

Class-Incremental learning (CIL) is the ability of artificial agents to ...
research
04/10/2017

ROSA: R Optimizations with Static Analysis

R is a popular language and programming environment for data scientists....
research
09/09/2021

AutoSmart: An Efficient and Automatic Machine Learning framework for Temporal Relational Data

Temporal relational data, perhaps the most commonly used data type in in...

Please sign up or login with your details

Forgot password? Click here to reset