Log In Sign Up

PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming

by   Alexander K. Lew, et al.

Data cleaning can be naturally framed as probabilistic inference in a generative model, combining a prior distribution over ground-truth databases with a likelihood that models the noisy channel by which the data are filtered and corrupted to yield incomplete, dirty, and denormalized datasets. Based on this view, we present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis. We show empirically that short (< 50-line) PClean programs can be faster and more accurate than generic PPL inference on multiple data-cleaning benchmarks; perform comparably in terms of accuracy and runtime to state-of-the-art data-cleaning systems (unlike generic PPL inference given the same runtime); and scale to real-world datasets with millions of records.


page 1

page 2

page 3

page 4


Automatic Alignment of Sequential Monte Carlo Inference in Higher-Order Probabilistic Programs

Probabilistic programming is a programming paradigm for expressing flexi...

Using probabilistic programs as proposals

Monte Carlo inference has asymptotic guarantees, but can be slow when us...

Scaling Exact Inference for Discrete Probabilistic Programs

Probabilistic programming languages (PPLs) are an expressive means of re...

First-Order Context-Specific Likelihood Weighting in Hybrid Probabilistic Logic Programs

Statistical relational AI and probabilistic logic programming have so fa...

Bayesian causal inference via probabilistic program synthesis

Causal inference can be formalized as Bayesian inference that combines a...

Coarse-to-Fine Sequential Monte Carlo for Probabilistic Programs

Many practical techniques for probabilistic inference require a sequence...

Probabilistic Data Analysis with Probabilistic Programming

Probabilistic techniques are central to data analysis, but different app...

1 Introduction

Real-world data is often noisy and incomplete, littered with NULL values, typos, duplicates, and other inconsistencies. This can make it difficult to integrate multiple sources of data, or to extract useful information even from a single dataset. Cleaning dirty data—e.g. detecting and correcting errors, imputing missing values, or linking duplicate records—is thus an important first step in most data analysis workflows. Unfortunately, data cleaning has proven remarkably resistant to reliable automation, due to the heterogeneity of error patterns in real-world applications Abedjan2016.

Researchers have long recognized probabilistic generative modeling as an appealing approach to data cleaning problems De2016; DeSa2019; Hu2012; Kubica2003; Zhao2012

. A generative model for data cleaning specifies a prior probability distribution over latent clean data, together with a likelihood model describing how the clean data is noisily observed. Bayesian inference algorithms can then be applied to infer a posterior distribution over latent clean datasets from dirty observations. In theory, generative models can exploit modularly encoded domain knowledge about particular datasets or error types, and weigh heterogeneous factors to detect errors and propose fixes in dirty data. But existing generative modeling techniques have not seen widespread adoption, compared to discriminative and weighted-logic approaches 

Dallachiesat2013; Mccallum2003; Wellner2004; Wick2013; Rekatsinas2017, some of which power industry data cleaning solutions. This is the case even though discriminative methods have significant drawbacks: they make it difficult to incorporate domain knowledge (and thus achieve acceptable accuracy), quantify uncertainty about proposed corrections, or, in some machine-learning-based approaches, audit the process by which cleaning decisions are made. Why isn’t the more flexible generative modeling approach more widely used? Several key challenges for generative data-cleaning remain to be solved.

Challenge 1. It is not feasible to create bespoke generative models and inference algorithms for each new dataset. Over decades, researchers have built special-purpose generative models and inference algorithms for narrow domains or particular types of errors Kiasari2017; MayfieldJenniferNeville2009; Pasula2003; Sontag2007; Steorts2016; Winn2017; Xiong2011; Zhao2012, leveraging carefully encoded domain knowledge to deliver high accuracy. Unfortunately, designing such models, and deriving and implementing effective inference algorithms for them, is a time-consuming task that requires significant expertise. Probabilistic programming tools Cusumano-Towner2019; ge2018turing; goodman2012church; Gordon2014; milch2007; wood2014new aim to ease this burden by providing languages for concisely specifying probabilistic models and tools for automating aspects of inference. But today’s automatic inference technology is not sufficient for automatic data cleaning: it typically relies either on gradient-based sampling and optimization (e.g., using Hamiltonian Monte Carlo or ADVI), which is not directly applicable in models with many discrete latent variables, or on domain-general stochastic search algorithms (e.g., single-site Metropolis-Hastings) that can be prohibitively slow to converge in complex models.

Challenge 2. One-size-fits-all models cannot exploit dataset-specific domain knowledge, and so tend to underfit on real-world data. An alternative to creating bespoke models per dataset is to design a single, one-size-fits-all model that encodes assumptions about data cleaning in general but not about specific datasets De2016; Hu2012; Matsakis2010. However, this approach strips generative models of one of their key advantages: the ability to incorporate arbitrary knowledge about a problem, leading to improved accuracy and results that are interpretable in the context of a domain. One-size-fits-all models instead make simplifying assumptions about the data (e.g., independence of data entries or sometimes columns; limited error types; no continuous variables), limiting their applicability to or performance on real-world datasets. This challenge arises also for approaches that attempt to learn a model from data Mansinghka2015; Saad2016; Vergari2019: in this case, the simplifying assumptions are baked into the class of models over which the learning algorithm searches.

Challenge 3. Furthermore, systems support for efficient Bayesian inference is lacking. Bayesian inference algorithms, and especially Monte Carlo algorithms for posterior sampling, have a reputation for being slow. There is no fundamental reason why this must be the case: for particular models and in particular data regimes, it is often possible to develop efficient algorithms that yield accurate results quickly in practice (even if existing theory cannot accurately characterize the regimes in which they work well). But little tooling exists for deriving these fast algorithms, or for implementing them using efficient data structures and computation strategies (though see Cusumano-Towner2019; Huang2017; Mansinghka2009; Walia2019

for some work in this direction). Compare this to the state-of-the-art in deep learning, in which specialized hardware, software libraries, and compilers help to ensure that compute-intensive training algorithms can be run in a reasonable amount of time, with little or no performance engineering by the user.

Our approach. In this work, we present PClean, a domain-specific probabilistic programming language for Bayesian data cleaning. PClean’s architecture is based on three modeling and inference contributions, each addressing limitations of prior work in probabilistic data cleaning and probabilistic programming:

  1. A domain-general generative model for cleaning data, customizable via dataset-specific priors. This model posits a non-parametric, relational generative process in which latent database tables are generated, joined, and corrupted to yield dirty, denormalized data.

  2. A domain-general particle MCMC inference algorithm combining sequential Monte Carlo initialization with particle Gibbs rejuvenation. This algorithm is user-configurable via concise dataset-specific inference hints that inform how the problem is decomposed and what data-driven proposals are used.

  3. A domain-specific probabilistic programming language for augmenting the domain-general model with dataset-specific priors about latent relational structure and likely errors, and dataset-specific inference hints to improve inference performance.

This paper also contributes empirical demonstrations that short (< 50 line) probabilistic programs deliver higher accuracy than state-of-the-art machine learning and weighted logic baselines, with accuracy that improves as more domain knowledge is incorporated. Finally, we show that PClean can scale to handle large, real-world datasets, by applying it to detect and correct errors in Medicare’s 2.2-million-row database of health care professionals.

To the best of our knowledge, this paper is the first to show that a generative Bayesian approach (jointly modeling identity uncertainty, record linkage, and data errors) can be deployed across a broad range of real-world problems with modest problem-specific effort, including instances with millions of rows, to yield higher accuracy and comparable performance relative to strong weighted logic baselines. Our results show that it is feasible and useful to integrate modeling and inference insights from the Bayesian non-parametrics, relational learning, data cleaning, and Monte Carlo inference literatures, into a single domain-specific probabilistic programming language.

2 Modeling

PClean is based on a domain-general generative model for dirty data, that can be specialized to particular datasets via domain-specific probabilistic programs. The generative model is relational: it posits a latent network of interrelated objects underlying the observed data (e.g. landlords and neighborhoods in a dataset of apartment listings), organized into a set of latent classes. The model is also non-parametric: the number of latent objects in each class is unbounded. Observed data records are modeled as depending on attributes of one or more of these latent objects, via a noisy channel. Domain knowledge is incorporated via generative probabilistic programs that define the latent relational domain, probable connections, attribute-level priors, and likely errors.

2.1 PClean Modeling Language

Dataset-Specific PClean Program A program declares one or more classes. The class dependency graph must be acyclic. Each class declares parameters shared among all objects of the class, reference slots that refer to other classes, and attributes sampled from primitive distributions. Domain-General Generative Model Given a probabilistic program declaring latent classes , we consider the following generative process for tabular data: GenerateDataset():       for latent class  do                                               for  do                     GenerateCollection():                      GenerateObject():       for reference slot  do                           for attribute  do                    
Figure 1: PClean’s generative model for tabular data (right) is parameterized by a probabilistic program encoding dataset-specific domain knowledge (left). The program declares a set of latent classes representing the kinds of objects reflected in the dataset. Each class is equipped with parameters , attributes and reference slots . The generative process proceeds along the dependency graph induced by the reference slots, generating objects of each class only once it has finished processing the class’s parents. For each latent class , class-wide parameters are generated first, followed by an infinite weighted collection of objects , sampled from a Pitman-Yor Process. Finally, the observed data entries are generated from the Obs class.

The precise generative process our model describes depends on the details of a PClean probabilistic program (Figure 1, left), which defines a set of latent classes representing the types of object (e.g. Neighborhood, Landlord) that populate the latent object network, as well as an observation class Obs modeling the records of the observed dataset (e.g. apartment listings).

The declaration of a PClean class may include three kinds of statement: reference statements (), which define a foreign key or reference slot that connects objects of class to objects of a target class ; attribute statements (), which define a new field or attribute that objects of the class possess, and declare an assumption about the probability distribution that the attribute typically follows; and parameter statements (), which introduce global parameters shared among all objects of the class , to be learned from the noisy dataset. As in probabilistic relational models Friedman1999, the distribution of an attribute may depend on the values of a parent set of attributes, potentially accessed via reference slots. For example, in Figure 1, the Obs class has a complex reference slot with target class Complex, and a desc attribute whose value depends on

Programs may also invoke arbitrary deterministic computations, e.g. string manipulation or arithmetic. Figure 2 shows how PClean’s modeling language can be used to capture diverse data and error patterns.

2.2 Generative Process

Given a program defining latent classes and observation class Obs, the generative process for a dataset is depicted in the right panel of Figure 1. We describe two representations of this process:

Discrete Random Measure Representation. We process classes one at a time, in topological order. For each latent class, we (1) generate class-wide parameters from their corresponding priors, and (2) generate an infinite weighted collection of objects of class . An object of class is an assignment of each attribute to a value and of each reference slot to an object of class . An infinite collection of latent objects is generated via a Pitman-Yor Process Teh2011:

The Pitman-Yor Process is a discrete random measure that generalizes the Dirichlet Process. It can be understood as first sampling an infinite vector of probabilities

from a two-parameter GEM distribution, then setting , where each of the infinitely many objects is distributed according to . This itself is a distribution over objects, which first samples reference slots and then attributes (see Figure 1).

To generate the observed dataset, we sample from its prior distribution, then, for , generate the observed entry: .

Chinese Restaurant Process Representation. We can also describe a finitely representable Chinese Restaurant version of this process. Consider a collection of restaurants, one for each class , where each table serves a dish representing an object of class . Upon entering a restaurant, customers either sit at an existing table or start a new one, as in the usual generalized CRP construction. But these restaurants require that to start a new table, customers must first send friends to other restaurants (one to the target of each reference slot). Once they are seated at these parent restaurants, they phone the original customer to help decide what to order, i.e., how to sample the attributes of the new table’s object, informed by their dishes (the objects of class ). The process starts with customers at the observation class Obs’s restaurant, who sit at separate tables.

Figure 2: PClean programs can model a variety of data cleaning scenarios. All of these patterns can be used individually or combined in a single script, depending on the user’s dataset.

3 Inference

Initialize each particle with empty set of objects
Initialize particle weights
for  do Process each data record in sequence
     for  do Update each particle
          Incorporate observation
      Resample using particle weights
for  do Particle Gibbs sweeps
     for object in  do Update each latent object
          Remove and anything that only it references
          Set retained particle
         for  do Propose other particles
          Select a particle and update state
procedure DataDrivenProposal(class , direct observations , partial state )
     for subproblem  do
          Propose new attributes and reference slots
         for  do Add newly created objects
          Add proposed attributes and reference slots to      
      Score downstream observations
Algorithm 1 Sequential Monte Carlo inference with particle Gibbs rejuvenation and data-driven proposals

Generic sequential Monte Carlo and particle Gibbs algorithms are supported by several probabilistic programming languages ge2018turing; Mansinghka2014; Murray2015; VanDeMeent2015; wood2014new. PClean’s inference algorithm (Algorithm 1) is based on sequential Monte Carlo initialization and particle Gibbs rejuvenation, but differs from generic PPL implementations in two ways. First, PClean uses per-object PGibbs updates that exploit the exchangeability of the domain-general model Lloyd2014. These updates allow for joint sampling of all latent variables associated with a single object. In contrast, generic PPL implementations operate over the complete state space. For our model, this would entail repeated costly iterations, each maintaining multiple copies of the complete latent state, rather than fast sweeps limited to a single latent object111The Venture inference engine can perform PGibbs over subproblems defined by single objects, but Venture’s general-purpose implementation is not fast enough for large-scale problems.. Second, PClean uses data-driven proposals that lead to accurate results much faster than proposing from the prior (Figure 3). These proposals are informed by dataset-specific “inference hints” embedded in PClean programs, enabling end-users to concisely customize PClean’s algorithm and empirically optimize performance, without deriving custom proposals themselves.

Figure 3: Accuracy (recall, precision, and F1) vs. time (seconds) for three independent runs of two algorithms: PClean’s data-driven SMC+PG with two particles, and 100-particle SMC+PG that uses generic proposals from general-purpose probabilistic programming languages. The default algorithm samples proposals from the prior, so it sometimes fails to propose the correct clean value even in cases where the cell is already clean. This leads to very low precision and thus low F1. PClean’s precision is also poor after a single SMC pass (around 15-20 seconds), but improves greatly with a single particle Gibbs sweep (around 40 seconds later). The dataset is a version of Hospital (19,000 cells) with synthetically introduced typos and 20% of the values deleted.

Notation. The algorithm operates over a finite representation of the model’s state space based on the Chinese Restaurant Process analogue described in Section 2.2. We maintain for each particle an initially empty collection of currently instantiated objects of all classes. We write to denote the operation of removing an object from : all reference slots pointing to are filled with the placeholder value , and if was the only object that referred to some other object , then this subroutine is invoked recursively to remove as well.

Data-driven proposals. During SMC and particle Gibbs sweeps (Algorithm 1), object attributes and reference slots are proposed using a data-driven proposal . is a distribution over a new object of class , and a (potentially empty) set of new objects that may be created as targets for ’s reference slots. It is informed by observed attributes , as well as by , a collection of existing objects of all classes, with some objects having placeholder reference slots targeting the object to be proposed by . uses subproblem decomposition Cusumano-Towner2019; Mansinghka2014; Mansinghka2018 based on user inference hints (subproblem beginend in Figure 1), which partition ’s attributes and reference slots into an ordered set of subproblems . We write for the attributes introduced in subproblem , and for the reference slots . The proposal factors as a sequence of sub-proposals, one for each subproblem.

Each subproposal targets the local conditional distribution associated with the subproblem’s attributes and reference slots, , which is defined by weighting a subproblem prior by relevant likelihood terms. In the prior, the reference slots for

follow the Pitman-Yor prior; new objects generated during their generation, if any, constitute the set-valued random variable

. Attributes for are distributed according to

where is evaluated using any reference slots and attribute values already fixed as part of . The local conditional is then obtained by combining this subproblem prior distribution over reference slots and attributes with likelihood terms arising from direct observations and from objects in with placeholder reference slots (which are intended to reference ). In particular, we consider likelihoods that are possible to evaluate given only the partial object within the domain of this subproblem. Writing to indicate that is evaluable in subproblem , we have that for each tuple in , the subproblem conditional is weighted by the likelihood term , where the placeholder is evaluated using and , the available portion of the partial object .

Ideally, would propose values according to this local subproblem posterior, but this requires marginalization and will be intractably slow if distributions have large domains. For such distributions, users may provide as inference hints smaller sets of preferred values on which they believe posterior mass will concentrate. These preferred values do not require careful tuning; in this paper, our experiments use the default choice of all values empirically observed in the dataset. We define the surrogate distributions to equal when and to place mass on a special other token. By replacing all samples of latent attribute variables in with draws from instead of from , we obtain a surrogate subproblem posterior that is tractable, since all variables have lower-dimensional discrete domains. (PClean also supports continuous variables, which are pruned from these subproblems and proposed from their prior distributions, but for ease of presentation we describe the discrete case here.) is then sampled in two steps: first a sample is drawn from the tractable surrogate subproblem posterior . Then, if any other tokens were sampled, samples proposals from the corresponding attribute priors .

Time complexity of inference. In the worst case, a fully connected subproblem with discrete variables of domain size can take time to solve, though in practice, conditional independence enables much more efficient variable elimination routines. When a reference slot is involved, the complexity will depend on , the number of objects in the current state that might be the target of the reference slot. Letting , the overall complexity of the algorithm becomes where we treat domain sizes

of discrete random variables as folded into the constant. Note that

is upper-bounded by , but will often grow much more slowly than . PClean hashes all objects into “canopies” mccallum2000efficient by their directly observed attributes, to access the set of possible (i.e., non-zero likelihood) target objects of each reference slot in time.

4 Results

We empirically evaluate PClean’s accuracy, customizability, and scalability, versus state-of-the-art baselines. All experiments were performed on one laptop running macOS Catalina with a 2.6 GHz core i7 processor and 32 GB of memory, with PClean experiments using two-particle data-driven sequential Monte Carlo with particle Gibbs sweeps.

4.1 Quantitative measurements of accuracy and runtime

Tasks. We first evaluate PClean’s accuracy and performance on three tasks with known ground-truth: the Flights and Hospital tasks, which are standard benchmarks from the data cleaning literature Rekatsinas2017, and a Rent task that we developed. Hospital contains artificial typos in 5% of cells. Flights lists flight departure and arrival times from often conflicting websites. We use the version from Mahdavi2019. Rents consists of 50,000 apartment listings created using county-level HUD and census statistics USCensusBureau2019

, with misspellings of counties, missing states and number of bedrooms, and incorrect units on rent prices. Additional details (links to data, PClean source code, and all hyperparameters used) can be found in the Supplementary Materials.

Task Metric PClean HoloClean22footnotemark: 2 HoloClean Rekatsinas2017 NADEEF Dallachiesat2013 PClean Ablation
Flights Prec 0.91 0.79 0.39 0.76 0.41
Rec 0.89 0.55 0.45 0.03 0.36
F1 0.90 0.64 0.41 0.07 0.38
Time 13.5s 45.4s 32.6s 9.1s 9.2s
Hosp. Prec 1.0 0.95 1.0 0.99 0.92
Rec 0.83 0.85 0.71 0.73 0.17
F1 0.91 0.90 0.83 0.84 0.29
Time 39.9s 1m 10s 1m 32s 27.6s 59.8s
Rent Prec 0.68 0.83 0.83 0 0.48
Rec 0.69 0.34 0.34 0 0.44
F1 0.69 0.48 0.48 N/A 0.46
Time 9m 28s 20m 16s 13m 43s 6m 10s

Unpublished version of HoloClean, on the dev branch of

Table 1: Results of PClean and various baseline systems on three diverse cleaning tasks.
Baseline. Sites
equally reliable.
+ Modeling of
timestamp format
+ Learned per-site
reliability scores
+ Prior that airline’s
site is better
F1 0.56 0.60 0.69 0.90
Lines of Code 16 16 17 18
Table 2: Four PClean models on Flights. Each model (L to R) uses more domain knowledge.

Baselines. We compare against three baselines. HoloClean is a data cleaning system based on probabilistic machine learning Rekatsinas2017, which compiles custom integrity constraints into a factor graph with learned weights. We use both the latest (unpublished) version from the author’s Github and the version published in Rekatsinas2017. NADEEF is a data cleaning system that leverages user-specified cleaning rules via a variable-weighted MAX-SAT solver based algorithm Dallachiesat2013. PClean Ablation is an ablation of PClean that removes the relational structure and only allows the single class Obs.

PClean models. We briefly describe the key features of the models we use for each dataset; see Supplementary Materials for the probabilistic programs. For Hospital, latent classes are City, Place, Hospital, Metric, Condition, and HospitalType, and each field is modeled as a potentially misspelled version of a latent value. For Flights, latent classes are Flight and TrackingWebsite. Each site has a learned reliability property, and is presumed to be more reliable if it is the site of the airline running the flight. For Rents, the latent class is County, whose attributes are a name, a state, and an average rent for each apartment type (studio, 1BR, etc.).

Results. Table 1 records recall (, precision (, and () for each experiment. PClean achieves the highest score on all tasks.

4.2 Improving results by incorporating additional domain knowledge

Table 2 shows that small source code additions to incorporate additional domain knowledge can improve cleaning quality. See Supplementary Materials for full PClean source code for each example.

4.3 Scaling to large, real-world data

Figure 4: PClean applied to Medicare’s Physician Compare National database. Displayed is the input, the actual inferred latent entities, and cleaned output. PClean corrects systematic errors (e.g. the misspelled Abington, MD appears 152 times in the dataset) and imputes missing values.

Figure 4 shows results from applying PClean to Medicare’s 2.2-million-row Physician Compare database Medicare. This dataset contains many missing values and systematic errors.

Results. PClean took 7h36m, changing 8,245 values and imputing 1,535,415 missing cells. In a random subsample of 100 imputed cells, we found 90% agreed with manually obtained ground truth values. We also manually checked PClean’s changes, finding that 7,954 were correct, for a precision of 96.5%. Of these, some were correct normalization (e.g. choosing a single spelling for cities whose names could be spelled in multiple ways). We cannot quantify recall without ground truth, but to calibrate overall result quality, we ran NADEEF and HoloClean on this data. NADEEF changed only 88 cells, and HoloClean domain initialization alone takes 28 hours.

Figure 4 illustrates PClean’s behavior on four rows from the dataset (showing 8/38 columns). Consider the misspelling Abington, MD, which appears in 152 entries. The correct spelling Abingdon, MD occurs in only 42. However, PClean recognizes Abington, MD as an error because all 152 instances share a single address, and errors are modeled as happening systematically at the address level. Now consider PClean’s correct inference that K. Ryan’s degree is DO. PClean leverages the fact that her school PCOM awards more DOs than MDs, even though more Family Medicine doctors are MDs than DOs. The parameters that enable this reasoning are learned from the dirty dataset itself.

5 Discussion

Limitations and future work. PClean has limitations relative to probabilistic relational models Friedman1999: (1) PClean handles only acyclic class dependency graphs; (2) one object’s attributes can depend on another’s, but not on aggregate properties of sets of related objects; and (3) PClean models cannot encode priors over reference slots that depend on attributes, e.g. cinemas selecting movies to show based on genre (though such dependencies can arise in the posterior). Hierarchical Pitman-Yor processes and more sophisticated probabilistic programming architectures could potentially be used to lift these restrictions, without sacrificing scalability. PClean could also be extended in other ways, e.g. to leverage uncertainty quantification for interactive cleaning Kobren2019, or to synthesize error models and functional dependencies automatically Guo2019; Heidari2019.


This paper has shown that it is possible to clean dirty, denormalized data from diverse domains, via domain-specific probabilistic programs. The modeling assumptions and inference hints in these programs require under 50 lines of code, and yield state-of-the-art accuracy compared to strong baselines using weighted logic and machine learning. This language also scales to datasets with millions of rows. The power of domain-specific probabilistic programming languages, that integrate sophisticated styles of modeling and inference developed over years of research into simple languages, has been demonstrated in other challenging domains such as 3D computer vision 

Kulkarni2015; Mansinghka2015. PClean offers analogous benefits for Bayesian data cleaning. We hope that PClean is useful to practitioners, helping non-experts clean data that would otherwise be unusable for analysis, and that the success of this approach encourages AI and probabilistic programming researchers to develop domain-specific probabilistic programming languages for other important problem domains.

6 Broader Impact

The widespread need for clean data. Many organizations struggle with dirty data — including large corporations, schools, non-profits, and governmental agencies Rahm2000

. In surveys, data analysts report that cleaning dirty data takes up more than sixty percent of their time, and the economic cost of unclean data has been estimated at trillions of dollars

Press2016; Redman2016. PClean represents a step toward a reliable, broadly accessible data-cleaning approach that is simultaneously customizable and robust enough to be applicable to a broad range of problems in multiple domains.

Public interest uses for data cleaning. Data cleaning is a central bottleneck in data journalism, social science, and policy evaluation halevy2012data

. We intend to assist people seeking to apply PClean to these kinds of public interest use cases, especially where the data comes from manual entry or scraping from volunteers. We plan to open-source the software as a Julia package, eliminating financial obstacles to PClean’s use.

Harmful uses for data cleaning. PClean could make it easier for companies, governments, and political organizations to link personal information from multiple sources Sweeney2001. Research on de-anonymization has shown repeatedly that vulnerable populations are easier to de-anonymize Horawalavithana2019. Better data cleaning technology, applied by oppressive "surveillance states" and/or for-profit companies whose business models depend on surveillance of consumers, could make these vulnerable people even easier to identify and exploit or persecute.

Data biases. PClean’s inference algorithm may infer parameters or impute missing values according to biases encoded in the source data. PClean could thus produce imputations that reflect systemic discrimination. One potentially mitigating factor is that unlike many machine learning approaches to data cleaning, PClean makes explicit generative assumptions. It is thus conceptually straightforward to prevent PClean from incorporating known biases, e.g. by requiring columns containing information about race, gender, or sexual orientation to be modeled independently of outcomes.


The authors are grateful to Zia Abedjan, Divya Gopinath, Marco Cusumano-Towner, Raul Castro Fernandez, Cameron Freer, Christina Ji, Tim Kraska, Feras Saad, Michael Stonebraker, and Josh Tenenbaum for useful conversations and feedback. This work is supported in part by a Facebook Probability and Programming Award, the National Science Foundation, and a philanthropic gift from the Aphorism Foundation.


Appendix A PClean model programs

In this section, we give the PClean source code for all models used in experiments. Code for PClean’s implementation itself, as well as for running these PClean models, is also included in the accompanying .zip file.

a.1 Hospital

The Hospital dataset is modeled with six latent classes. Typos are modeled as independently introduced for each cell of the dataset. Some fields are modeled as draws from broad priors over strings, whereas others are modeled as categorical draws whose domain is the set of unique observed values in the relevant column (some of which are in fact typos).

Inference hints are used to focus proposals for string_prior choices on the set of strings that have actually been observed in a given column, and also to set a custom subproblem decomposition for the Obs class (all other classes use the default decomposition).

latent class County
parameter state_proportions ~ dirichlet(ones(num_states))
state ~ discrete(observed_values[:State], state_proportions)
county ~ string_prior(3, 30, observed_values[:CountyName])
latent class Place
county ~ County
city ~ string_prior(3, 30, observed_values[:City])
latent class Condition
desc ~ string_prior(5, 35, observed_values[:Condition])
latent class Measure
code ~ uniform(observed_values[:MeasureCode])
name ~ uniform(observed_values[:MeasureName])
condition ~ Condition
latent class HospitalType
desc ~ string_prior(10, 30, observed_values[:HospitalType])
latent class Hospital
parameter owner_dist ~ dirichlet(ones(num_owners))
parameter service_dist ~ dirichlet(ones(num_services))
loc ~ Place
type ~ HospitalType
id ~ uniform(observed_values[:ProviderNumber])
name ~ string_prior(3, 50, observed_values[:HospitalName])
addr ~ string_prior(10, 30, observed_values[:Address1])
phone ~ string_prior(10, 10, observed_values[:PhoneNumber])
owner ~ discrete(observed_values[:HospitalOwner], owner_dist)
zip ~ uniform(observed_values[:ZipCode])
service ~ discrete(observed_values[:EmergencyService], service_dist)
observation class Obs
subproblem begin
hosp ~ Hospital; service ~ typos(hosp.service)
id ~ typos(; name ~ typos(
addr ~ typos(hosp.addr); city ~ typos(
state ~ typos(hosp.loc.county.state); zip ~ typos(
county ~ typos(hosp.loc.county.county); phone ~ typos(
type ~ typos(hosp.type.desc); owner ~ typos(hosp.owner)
subproblem begin
metric ~ Measure
code ~ typos(metric.code); mname ~ typos(;
condition ~ typos(metric.condition.desc)
stateavg = "$(hosp.loc.county.state)_$(metric.code)"
stateavg_obs ~ typos(stateavg)

a.2 Flights

We first introduce the model for Flights that we used to obtain the results presented in Section 4.1. We then show variants of the model with less domain knowledge incorporated, which are shown to perform worse.

The primary model is shown below. In the parameter declaration for error_probs, we use the syntax error_probs[_] beta(10, 50) to introduce a collection of parameters; the declared variable becomes a dictionary, and each time it is used with a new index, a new parameter is instantiated. We use this to learn a different error_prob parameter for each tracking website. We could alternatively declare error_prob as an attribute of the TrackingWebsite class. However, PClean’s inference engine uses smarter proposals for declared parameters (taking advantage of conjugacy relationships), so for our experiments, we use the parameter declaration instead. We hope to extend automatic conjugacy detection to all attributes, not just parameters, in the near future.

latent class TrackingWebsite
name ~ string_prior(2, 30, websites)
latent class Flight
flight_id ~ string_prior(10, 20, flight_ids); guaranteed flight_id
sdt ~ time_prior(observed_values["$flight_id-sched_dep_time"])
sat ~ time_prior(observed_values["$flight_id-sched_arr_time"])
adt ~ time_prior(observed_values["$flight_id-act_dep_time"])
aat ~ time_prior(observed_values["$flight_id-act_arr_time"])
observation class Obs
parameter error_probs[_] ~ beta(10, 50)
flight ~ Flight; src ~ TrackingWebsite
error_prob = lowercase( == lowercase(flight.flight_id[1:2]) ? 1e-5 : error_probs[]
sdt ~ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], error_prob)
sat ~ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], error_prob)
adt ~ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], error_prob)
aat ~ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], error_prob)

As in Hospital, we use observed_values to provide inference hints to the broad time_prior; this expresses a belief that the true timestamp for a certain field is likely one of the timestamps that has actually been observed, in the dirty dataset, with the given flight ID.

Alternative models. We now briefly list the alternative models (Models 1-3) from Section 3.2. The unmodeled keyword is used to introduce a value that is guaranteed to be observed, but which is not modeled as a random variable.

# Model 1
latent class TrackingWebsite
name ~ string_prior(2, 30, websites)
latent class Flight
flight_id ~ unmodeled(); guaranteed flight_id
sdt ~ uniform(observed_values["$flight_id-sched_dep_time"])
sat ~ uniform(observed_values["$flight_id-sched_arr_time"])
adt ~ uniform(observed_values["$flight_id-act_dep_time"])
aat ~ uniform(observed_values["$flight_id-act_arr_time"])
observation class Obs begin
flight ~ Flight; src ~ TrackingWebsite
sdt ~ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], 0.1)
sat ~ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], 0.1)
adt ~ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], 0.1)
aat ~ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], 0.1)
# Model 2
latent class TrackingWebsite
name ~ string_prior(2, 30, websites)
latent class Flight
flight_id ~ unmodeled(); guaranteed flight_id
sdt ~ time_prior(observed_values["$flight_id-sched_dep_time"])
sat ~ time_prior(observed_values["$flight_id-sched_arr_time"])
adt ~ time_prior(observed_values["$flight_id-act_dep_time"])
aat ~ time_prior(observed_values["$flight_id-act_arr_time"])
observation class Obs begin
flight ~ Flight; src ~ TrackingWebsite
sdt ~ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], 0.1)
sat ~ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], 0.1)
adt ~ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], 0.1)
aat ~ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], 0.1)
# Model 3
latent class TrackingWebsite
name ~ string_prior(2, 30, websites)
latent class Flight
flight_id ~ unmodeled(); guaranteed flight_id
sdt ~ time_prior(observed_values["$flight_id-sched_dep_time"])
sat ~ time_prior(observed_values["$flight_id-sched_arr_time"])
adt ~ time_prior(observed_values["$flight_id-act_dep_time"])
aat ~ time_prior(observed_values["$flight_id-act_arr_time"])
observation class Obs begin
parameter error_probs[_] ~ beta(10, 50)
flight ~ Flight; src ~ TrackingWebsite
sdt ~ maybe_swap(flight.sdt, observed_values["$(flight.flight_id)-sched_dep_time"], error_probs[])
sat ~ maybe_swap(flight.sat, observed_values["$(flight.flight_id)-sched_arr_time"], error_probs[])
adt ~ maybe_swap(flight.adt, observed_values["$(flight.flight_id)-act_dep_time"], error_probs[])
aat ~ maybe_swap(flight.aat, observed_values["$(flight.flight_id)-act_arr_time"], error_probs[])

a.3 Rents

The program we use for Rents is shown below. We model the fact that the rent may be in grand instead of dollars, as well as that the county name may contain typos. We introduce an artificial field, block, consisting of the first and last letters of the observed (possibly erroneous) County field, and use it to inform an inference hint: we hint that posterior mass for a county’s name concentrates on those strings observed somewhere in the dataset that share a first and last letter in common with the observed county name for this row. Without this approximation, inference is much slower (but potentially more accurate).

data_table.block = map(x -> "$(x[1])$(x[end])", data_table.County)
units = [Transformation(identity, identity, x -> 1.0),
Transformation(x -> x/1000.0, x -> x*1000.0, x -> 1/1000.0)]
latent class County
parameter state_pops ~ dirichlet(ones(num_states))
block ~ unmodeled(); guaranteed block
name ~ string_prior(10, 35, observed_values[block])
state ~ discrete(states, state_pops)
observation class Obs
parameter avg_rent[_] ~ normal(1500, 1000)
subproblem begin
county ~ County
county_name ~ typos(, 2)
br ~ uniform(room_types)
unit ~ uniform(units)
rent_base = avg_rent["$(county.state)_$($(br)"]
observed_rent ~ transformed_normal(rent_base, 150.0, unit)
rent = round(unit.backward(observed_rent))

a.4 Physicians

The model for Physicians is given below. Many columns are not modeled. Similar to Rents, we use a parameter in the Physician class for degree_probs, although it might seem more natural to use an attribute of the School class; the resulting model is the same, but using parameter allows PClean to exploit conjugacy.

latent class School
name ~ unmodeled(); guaranteed name
latent class Physician
parameter error_prob ~ beta(1.0, 1000.0)
parameter degree_proportions[_] ~ dirichlet(3 * ones(num_degrees))
parameter specialty_proportions[_] ~ dirichlet(3 * ones(num_specialties))
npi ~ number_code_prior()
school ~ School
subproblem begin
degree ~ discrete(observed_values[:Credential], degree_proportions[])
specialty ~ discrete(observed_values[Symbol("Primary specialty")], specialty_proportions[degree])
degree_obs ~ maybe_swap(degree, observed_values[:Credential], error_prob)
latent class City
c2z3 ~ unmodeled(); guaranteed c2z3
name ~ string_prior(3, 30, cities[c2z3])
latent class BusinessAddr
addr ~ unmodeled(); guaranteed addr
addr2 ~ unmodeled(); guaranteed addr2
zip ~ string_prior(3, 10); guaranteed zip
legal_name ~ unmodeled(); guaranteed legal_name
subproblem begin
city ~ City
city_name ~ typos(
observation class Obs
p ~ Physician
a ~ BusinessAddr

Appendix B Inference hints

PClean supports three types of inference hint: subproblem declarations, preferred values arguments, and guaranteed observation statements. In this section, we describe each, and give an empirical demonstration of how the results in Section 3 depend on them (Table 3).

b.1 Subproblem declarations

Subproblem declarations allow users to explicitly control which attributes and reference slots are included within each subproblem (see Section 3). Larger subproblems lead to more expensive subproblem proposals , but can lead to more accurate results. Users declare a subproblem by wrapping adjacent attribute and reference statements into a subproblem beginend block.

To see the value of subproblem declarations, consider the Hospital program above. If we remove the subproblem beginend inference hint around the second subproblem in that model, then each line is treated as its own subproblem. We ran cleaning using this modified model (but kept other settings equal) and obtained an score of only 0.18 (recall = 0.71, precision = 0.11).

b.2 Preferred values arguments

Preferred values arguments are optional arguments to distributions like string_prior that have infinite or very large support. The model itself does not change as a result of providing a preferred values argument. However, the proposal is adapted to reflect that posterior mass is expected to concentrate on the user-provided list of preferred values. In the models for this paper, we often specified preferred values equal to all observed values in a particular column, or values observed to co-occur with another value in a separate column. For example, we model the name of a city in the Hospital dataset as a string generated from a broad prior, but indicate that we expect posterior mass to concentrate on the set of strings that have actually been observed as city names in the dataset.

To illustrate the value of preferred values arguments, we perform two targeted experiments. First, for the Flights dataset, we consider removing the preferred values argument from the time_prior call for the scheduled departure time of a flight. This yields a diminished overall of 0.62 (recall = 0.70, precision = 0.55), due to mostly incorrect inferences about the scheduled departure time field. (Runtime is also slightly longer, because of the increased number of sampling operations from the time_prior proposal.) Second, we consider the effect of a preferred values argument that is very broad: we replace the more targeted preferred values argument for scheduled departure time with a list of all timestamps appearing anywhere in the Flights data (around 800 options). Inference quality is unaffected ( = 0.90, precision = 0.91, recall = 0.89), but running time is 7x longer: completing five iterations of particle Gibbs requires 95 seconds, instead of 13.

Preferred values arguments are a simple way to make inference in large discrete domains (e.g., strings) tractable. Researchers have also explored more sophisticated techniques for inference with strings [Winn2017, Yangel]. It may be possible to incorporate such strategies into a future version of PClean for more complex string-based inference, alleviating the need for preferred values arguments in some cases.

b.3 Guaranteed observation statements

Guaranteed observations are declared using the guaranteed keyword, which tells PClean that the value of a particular variable in a PClean model is guaranteed to be observed. This allows PClean to index object collections by these observed variable values, enabling fast lookups of all existing latent objects that are consistent with a particular observation of the variable. In the Flights dataset, removing the guaranteed statement from the flight_id attribute yields no change in inference results, but a modest 31% increase in running time (up to 17s from 13s). On a larger dataset, this runtime difference would be more pronounced.

Dataset + Inference Hints F1 Recall Precision Runtime
Hospital (with both subproblem blocks) 0.91 0.83 1.0 39.9s
Hospital (with only first subproblem block) 0.18 0.71 0.11 37.8s
Flights (with preferred values, guaranteed flight ID) 0.90 0.89 0.91 13.5s
Flights (no preferred values for departure time) 0.62 0.70 0.55 17.6s
Flights (overly broad preferred values for departure time) 0.90 0.89 0.91 95.4s
Flights (no guaranteed flight ID) 0.90 0.89 0.91 17.2s
Table 3: Effect of removing each kind of inference hint on accuracy and runtime of inference.

Appendix C Synthetic Rent Dataset

Here we detail the generative process for the synthetic apartment rent dataset used to empirically evaluate PClean. The dataset was assembled using the County Population Totals table from the U.S. census and the 50th Percentile Rent Estimate table from the U.S. Department of Housing and Urban Development [USCensusBureau2019].

The clean dataset consists of 50,000 rows that were each generated in the following manner:

  • The county-state combination is chosen proportionally to its population in the United State

  • The size of the apartment is chosen uniformly from studio, 1 bedroom, 2 bedroom, 3 bedroom, 4 bedroom)

  • The rent is chosen according to a normal distribution in which the mean is the median rent for an apartment of the chosen size in the chosen country and the standard deviation is chosen to be 10% of the mean

The dataset was then dirtied in the following ways:

  • 10% of state names are deleted (many counties exist across multiple states, e.g. 30 states have a Washington County).

  • Approximately 1-2% of county names are misspelled

  • 10% of apartment sizes are deleted

  • 1% of apartment prices are listed in the incorrect units (thousands of dollars, instead of dollars)