PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

07/23/2020
by   Alexander K. Lew, et al.
10

Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (< 50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/18/2018

Automatic Alignment of Sequential Monte Carlo Inference in Higher-Order Probabilistic Programs

Probabilistic programming is a programming paradigm for expressing flexi...
research
01/11/2018

Using probabilistic programs as proposals

Monte Carlo inference has asymptotic guarantees, but can be slow when us...
research
05/18/2020

Scaling Exact Inference for Discrete Probabilistic Programs

Probabilistic programming languages (PPLs) are an expressive means of re...
research
01/26/2022

First-Order Context-Specific Likelihood Weighting in Hybrid Probabilistic Logic Programs

Statistical relational AI and probabilistic logic programming have so fa...
research
10/30/2019

Bayesian causal inference via probabilistic program synthesis

Causal inference can be formalized as Bayesian inference that combines a...
research
09/09/2015

Coarse-to-Fine Sequential Monte Carlo for Probabilistic Programs

Many practical techniques for probabilistic inference require a sequence...
research
11/05/2016

Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

Datasets with hundreds of variables and many missing values are commonpl...

Please sign up or login with your details

Forgot password? Click here to reset