1 Introduction
Real-world data is often noisy and incomplete, littered with NULL values, typos, duplicates, and other inconsistencies. This can make it difficult to integrate multiple sources of data, or to extract useful information even from a single dataset. Cleaning dirty data—e.g. detecting and correcting errors, imputing missing values, or linking duplicate records—is thus an important first step in most data analysis workflows. Unfortunately, data cleaning has proven remarkably resistant to reliable automation, due to the heterogeneity of error patterns in real-world applications Abedjan2016.
Researchers have long recognized probabilistic generative modeling as an appealing approach to data cleaning problems De2016; DeSa2019; Hu2012; Kubica2003; Zhao2012
. A generative model for data cleaning specifies a prior probability distribution over latent clean data, together with a likelihood model describing how the clean data is noisily observed. Bayesian inference algorithms can then be applied to infer a posterior distribution over latent clean datasets from dirty observations. In theory, generative models can exploit modularly encoded domain knowledge about particular datasets or error types, and weigh heterogeneous factors to detect errors and propose fixes in dirty data. But existing generative modeling techniques have not seen widespread adoption, compared to discriminative and weighted-logic approaches
Dallachiesat2013; Mccallum2003; Wellner2004; Wick2013; Rekatsinas2017, some of which power industry data cleaning solutions. This is the case even though discriminative methods have significant drawbacks: they make it difficult to incorporate domain knowledge (and thus achieve acceptable accuracy), quantify uncertainty about proposed corrections, or, in some machine-learning-based approaches, audit the process by which cleaning decisions are made. Why isn’t the more flexible generative modeling approach more widely used? Several key challenges for generative data-cleaning remain to be solved.Challenge 1. It is not feasible to create bespoke generative models and inference algorithms for each new dataset. Over decades, researchers have built special-purpose generative models and inference algorithms for narrow domains or particular types of errors Kiasari2017; MayfieldJenniferNeville2009; Pasula2003; Sontag2007; Steorts2016; Winn2017; Xiong2011; Zhao2012, leveraging carefully encoded domain knowledge to deliver high accuracy. Unfortunately, designing such models, and deriving and implementing effective inference algorithms for them, is a time-consuming task that requires significant expertise. Probabilistic programming tools Cusumano-Towner2019; ge2018turing; goodman2012church; Gordon2014; milch2007; wood2014new aim to ease this burden by providing languages for concisely specifying probabilistic models and tools for automating aspects of inference. But today’s automatic inference technology is not sufficient for automatic data cleaning: it typically relies either on gradient-based sampling and optimization (e.g., using Hamiltonian Monte Carlo or ADVI), which is not directly applicable in models with many discrete latent variables, or on domain-general stochastic search algorithms (e.g., single-site Metropolis-Hastings) that can be prohibitively slow to converge in complex models.
Challenge 2. One-size-fits-all models cannot exploit dataset-specific domain knowledge, and so tend to underfit on real-world data. An alternative to creating bespoke models per dataset is to design a single, one-size-fits-all model that encodes assumptions about data cleaning in general but not about specific datasets De2016; Hu2012; Matsakis2010. However, this approach strips generative models of one of their key advantages: the ability to incorporate arbitrary knowledge about a problem, leading to improved accuracy and results that are interpretable in the context of a domain. One-size-fits-all models instead make simplifying assumptions about the data (e.g., independence of data entries or sometimes columns; limited error types; no continuous variables), limiting their applicability to or performance on real-world datasets. This challenge arises also for approaches that attempt to learn a model from data Mansinghka2015; Saad2016; Vergari2019: in this case, the simplifying assumptions are baked into the class of models over which the learning algorithm searches.
Challenge 3. Furthermore, systems support for efficient Bayesian inference is lacking. Bayesian inference algorithms, and especially Monte Carlo algorithms for posterior sampling, have a reputation for being slow. There is no fundamental reason why this must be the case: for particular models and in particular data regimes, it is often possible to develop efficient algorithms that yield accurate results quickly in practice (even if existing theory cannot accurately characterize the regimes in which they work well). But little tooling exists for deriving these fast algorithms, or for implementing them using efficient data structures and computation strategies (though see Cusumano-Towner2019; Huang2017; Mansinghka2009; Walia2019
for some work in this direction). Compare this to the state-of-the-art in deep learning, in which specialized hardware, software libraries, and compilers help to ensure that compute-intensive training algorithms can be run in a reasonable amount of time, with little or no performance engineering by the user.
Our approach. In this work, we present PClean, a domain-specific probabilistic programming language for Bayesian data cleaning. PClean’s architecture is based on three modeling and inference contributions, each addressing limitations of prior work in probabilistic data cleaning and probabilistic programming:
-
A domain-general generative model for cleaning data, customizable via dataset-specific priors. This model posits a non-parametric, relational generative process in which latent database tables are generated, joined, and corrupted to yield dirty, denormalized data.
-
A domain-general particle MCMC inference algorithm combining sequential Monte Carlo initialization with particle Gibbs rejuvenation. This algorithm is user-configurable via concise dataset-specific inference hints that inform how the problem is decomposed and what data-driven proposals are used.
-
A domain-specific probabilistic programming language for augmenting the domain-general model with dataset-specific priors about latent relational structure and likely errors, and dataset-specific inference hints to improve inference performance.
This paper also contributes empirical demonstrations that short (< 50 line) probabilistic programs deliver higher accuracy than state-of-the-art machine learning and weighted logic baselines, with accuracy that improves as more domain knowledge is incorporated. Finally, we show that PClean can scale to handle large, real-world datasets, by applying it to detect and correct errors in Medicare’s 2.2-million-row database of health care professionals.
To the best of our knowledge, this paper is the first to show that a generative Bayesian approach (jointly modeling identity uncertainty, record linkage, and data errors) can be deployed across a broad range of real-world problems with modest problem-specific effort, including instances with millions of rows, to yield higher accuracy and comparable performance relative to strong weighted logic baselines. Our results show that it is feasible and useful to integrate modeling and inference insights from the Bayesian non-parametrics, relational learning, data cleaning, and Monte Carlo inference literatures, into a single domain-specific probabilistic programming language.
2 Modeling
PClean is based on a domain-general generative model for dirty data, that can be specialized to particular datasets via domain-specific probabilistic programs. The generative model is relational: it posits a latent network of interrelated objects underlying the observed data (e.g. landlords and neighborhoods in a dataset of apartment listings), organized into a set of latent classes. The model is also non-parametric: the number of latent objects in each class is unbounded. Observed data records are modeled as depending on attributes of one or more of these latent objects, via a noisy channel. Domain knowledge is incorporated via generative probabilistic programs that define the latent relational domain, probable connections, attribute-level priors, and likely errors.
2.1 PClean Modeling Language

The precise generative process our model describes depends on the details of a PClean probabilistic program (Figure 1, left), which defines a set of latent classes representing the types of object (e.g. Neighborhood, Landlord) that populate the latent object network, as well as an observation class Obs modeling the records of the observed dataset (e.g. apartment listings).
The declaration of a PClean class may include three kinds of statement: reference statements (), which define a foreign key or reference slot that connects objects of class to objects of a target class ; attribute statements (), which define a new field or attribute that objects of the class possess, and declare an assumption about the probability distribution that the attribute typically follows; and parameter statements (), which introduce global parameters shared among all objects of the class , to be learned from the noisy dataset. As in probabilistic relational models Friedman1999, the distribution of an attribute may depend on the values of a parent set of attributes, potentially accessed via reference slots. For example, in Figure 1, the Obs class has a complex reference slot with target class Complex, and a desc attribute whose value depends on complex.loc.name.
Programs may also invoke arbitrary deterministic computations, e.g. string manipulation or arithmetic. Figure 2 shows how PClean’s modeling language can be used to capture diverse data and error patterns.
2.2 Generative Process
Given a program defining latent classes and observation class Obs, the generative process for a dataset is depicted in the right panel of Figure 1. We describe two representations of this process:
Discrete Random Measure Representation. We process classes one at a time, in topological order. For each latent class, we (1) generate class-wide parameters from their corresponding priors, and (2) generate an infinite weighted collection of objects of class . An object of class is an assignment of each attribute to a value and of each reference slot to an object of class . An infinite collection of latent objects is generated via a Pitman-Yor Process Teh2011:
The Pitman-Yor Process is a discrete random measure that generalizes the Dirichlet Process. It can be understood as first sampling an infinite vector of probabilities
from a two-parameter GEM distribution, then setting , where each of the infinitely many objects is distributed according to . This itself is a distribution over objects, which first samples reference slots and then attributes (see Figure 1).To generate the observed dataset, we sample from its prior distribution, then, for , generate the observed entry: .
Chinese Restaurant Process Representation. We can also describe a finitely representable Chinese Restaurant version of this process. Consider a collection of restaurants, one for each class , where each table serves a dish representing an object of class . Upon entering a restaurant, customers either sit at an existing table or start a new one, as in the usual generalized CRP construction. But these restaurants require that to start a new table, customers must first send friends to other restaurants (one to the target of each reference slot). Once they are seated at these parent restaurants, they phone the original customer to help decide what to order, i.e., how to sample the attributes of the new table’s object, informed by their dishes (the objects of class ). The process starts with customers at the observation class Obs’s restaurant, who sit at separate tables.

3 Inference
Generic sequential Monte Carlo and particle Gibbs algorithms are supported by several probabilistic programming languages ge2018turing; Mansinghka2014; Murray2015; VanDeMeent2015; wood2014new. PClean’s inference algorithm (Algorithm 1) is based on sequential Monte Carlo initialization and particle Gibbs rejuvenation, but differs from generic PPL implementations in two ways. First, PClean uses per-object PGibbs updates that exploit the exchangeability of the domain-general model Lloyd2014. These updates allow for joint sampling of all latent variables associated with a single object. In contrast, generic PPL implementations operate over the complete state space. For our model, this would entail repeated costly iterations, each maintaining multiple copies of the complete latent state, rather than fast sweeps limited to a single latent object111The Venture inference engine can perform PGibbs over subproblems defined by single objects, but Venture’s general-purpose implementation is not fast enough for large-scale problems.. Second, PClean uses data-driven proposals that lead to accurate results much faster than proposing from the prior (Figure 3). These proposals are informed by dataset-specific “inference hints” embedded in PClean programs, enabling end-users to concisely customize PClean’s algorithm and empirically optimize performance, without deriving custom proposals themselves.

Notation. The algorithm operates over a finite representation of the model’s state space based on the Chinese Restaurant Process analogue described in Section 2.2. We maintain for each particle an initially empty collection of currently instantiated objects of all classes. We write to denote the operation of removing an object from : all reference slots pointing to are filled with the placeholder value , and if was the only object that referred to some other object , then this subroutine is invoked recursively to remove as well.
Data-driven proposals. During SMC and particle Gibbs sweeps (Algorithm 1), object attributes and reference slots are proposed using a data-driven proposal . is a distribution over a new object of class , and a (potentially empty) set of new objects that may be created as targets for ’s reference slots. It is informed by observed attributes , as well as by , a collection of existing objects of all classes, with some objects having placeholder reference slots targeting the object to be proposed by . uses subproblem decomposition Cusumano-Towner2019; Mansinghka2014; Mansinghka2018 based on user inference hints (subproblem begin…end in Figure 1), which partition ’s attributes and reference slots into an ordered set of subproblems . We write for the attributes introduced in subproblem , and for the reference slots . The proposal factors as a sequence of sub-proposals, one for each subproblem.
Each subproposal targets the local conditional distribution associated with the subproblem’s attributes and reference slots, , which is defined by weighting a subproblem prior by relevant likelihood terms. In the prior, the reference slots for
follow the Pitman-Yor prior; new objects generated during their generation, if any, constitute the set-valued random variable
. Attributes for are distributed according towhere is evaluated using any reference slots and attribute values already fixed as part of . The local conditional is then obtained by combining this subproblem prior distribution over reference slots and attributes with likelihood terms arising from direct observations and from objects in with placeholder reference slots (which are intended to reference ). In particular, we consider likelihoods that are possible to evaluate given only the partial object within the domain of this subproblem. Writing to indicate that is evaluable in subproblem , we have that for each tuple in , the subproblem conditional is weighted by the likelihood term , where the placeholder is evaluated using and , the available portion of the partial object .
Ideally, would propose values according to this local subproblem posterior, but this requires marginalization and will be intractably slow if distributions have large domains. For such distributions, users may provide as inference hints smaller sets of preferred values on which they believe posterior mass will concentrate. These preferred values do not require careful tuning; in this paper, our experiments use the default choice of all values empirically observed in the dataset. We define the surrogate distributions to equal when and to place mass on a special other token. By replacing all samples of latent attribute variables in with draws from instead of from , we obtain a surrogate subproblem posterior that is tractable, since all variables have lower-dimensional discrete domains. (PClean also supports continuous variables, which are pruned from these subproblems and proposed from their prior distributions, but for ease of presentation we describe the discrete case here.) is then sampled in two steps: first a sample is drawn from the tractable surrogate subproblem posterior . Then, if any other tokens were sampled, samples proposals from the corresponding attribute priors .
Time complexity of inference. In the worst case, a fully connected subproblem with discrete variables of domain size can take time to solve, though in practice, conditional independence enables much more efficient variable elimination routines. When a reference slot is involved, the complexity will depend on , the number of objects in the current state that might be the target of the reference slot. Letting , the overall complexity of the algorithm becomes where we treat domain sizes
of discrete random variables as folded into the constant. Note that
is upper-bounded by , but will often grow much more slowly than . PClean hashes all objects into “canopies” mccallum2000efficient by their directly observed attributes, to access the set of possible (i.e., non-zero likelihood) target objects of each reference slot in time.4 Results
We empirically evaluate PClean’s accuracy, customizability, and scalability, versus state-of-the-art baselines. All experiments were performed on one laptop running macOS Catalina with a 2.6 GHz core i7 processor and 32 GB of memory, with PClean experiments using two-particle data-driven sequential Monte Carlo with particle Gibbs sweeps.
4.1 Quantitative measurements of accuracy and runtime
Tasks. We first evaluate PClean’s accuracy and performance on three tasks with known ground-truth: the Flights and Hospital tasks, which are standard benchmarks from the data cleaning literature Rekatsinas2017, and a Rent task that we developed. Hospital contains artificial typos in 5% of cells. Flights lists flight departure and arrival times from often conflicting websites. We use the version from Mahdavi2019. Rents consists of 50,000 apartment listings created using county-level HUD and census statistics USCensusBureau2019
, with misspellings of counties, missing states and number of bedrooms, and incorrect units on rent prices. Additional details (links to data, PClean source code, and all hyperparameters used) can be found in the Supplementary Materials.
Task | Metric | PClean | HoloClean22footnotemark: 2 | HoloClean Rekatsinas2017 | NADEEF Dallachiesat2013 | PClean Ablation |
---|---|---|---|---|---|---|
Flights | Prec | 0.91 | 0.79 | 0.39 | 0.76 | 0.41 |
Rec | 0.89 | 0.55 | 0.45 | 0.03 | 0.36 | |
F1 | 0.90 | 0.64 | 0.41 | 0.07 | 0.38 | |
Time | 13.5s | 45.4s | 32.6s | 9.1s | 9.2s | |
Hosp. | Prec | 1.0 | 0.95 | 1.0 | 0.99 | 0.92 |
Rec | 0.83 | 0.85 | 0.71 | 0.73 | 0.17 | |
F1 | 0.91 | 0.90 | 0.83 | 0.84 | 0.29 | |
Time | 39.9s | 1m 10s | 1m 32s | 27.6s | 59.8s | |
Rent | Prec | 0.68 | 0.83 | 0.83 | 0 | 0.48 |
Rec | 0.69 | 0.34 | 0.34 | 0 | 0.44 | |
F1 | 0.69 | 0.48 | 0.48 | N/A | 0.46 | |
Time | 9m 28s | 20m 16s | 13m 43s | 6m 10s |
Unpublished version of HoloClean, on the dev branch of https://github.com/HoloClean/holoclean
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1 | 0.56 | 0.60 | 0.69 | 0.90 | ||||||||||
Lines of Code | 16 | 16 | 17 | 18 |
Baselines. We compare against three baselines. HoloClean is a data cleaning system based on probabilistic machine learning Rekatsinas2017, which compiles custom integrity constraints into a factor graph with learned weights. We use both the latest (unpublished) version from the author’s Github and the version published in Rekatsinas2017. NADEEF is a data cleaning system that leverages user-specified cleaning rules via a variable-weighted MAX-SAT solver based algorithm Dallachiesat2013. PClean Ablation is an ablation of PClean that removes the relational structure and only allows the single class Obs.
PClean models. We briefly describe the key features of the models we use for each dataset; see Supplementary Materials for the probabilistic programs. For Hospital, latent classes are City, Place, Hospital, Metric, Condition, and HospitalType, and each field is modeled as a potentially misspelled version of a latent value. For Flights, latent classes are Flight and TrackingWebsite. Each site has a learned reliability property, and is presumed to be more reliable if it is the site of the airline running the flight. For Rents, the latent class is County, whose attributes are a name, a state, and an average rent for each apartment type (studio, 1BR, etc.).
Results. Table 1 records recall (, precision (, and () for each experiment. PClean achieves the highest score on all tasks.
4.2 Improving results by incorporating additional domain knowledge
Table 2 shows that small source code additions to incorporate additional domain knowledge can improve cleaning quality. See Supplementary Materials for full PClean source code for each example.
4.3 Scaling to large, real-world data

Figure 4 shows results from applying PClean to Medicare’s 2.2-million-row Physician Compare database Medicare. This dataset contains many missing values and systematic errors.
Results. PClean took 7h36m, changing 8,245 values and imputing 1,535,415 missing cells. In a random subsample of 100 imputed cells, we found 90% agreed with manually obtained ground truth values. We also manually checked PClean’s changes, finding that 7,954 were correct, for a precision of 96.5%. Of these, some were correct normalization (e.g. choosing a single spelling for cities whose names could be spelled in multiple ways). We cannot quantify recall without ground truth, but to calibrate overall result quality, we ran NADEEF and HoloClean on this data. NADEEF changed only 88 cells, and HoloClean domain initialization alone takes 28 hours.
Figure 4 illustrates PClean’s behavior on four rows from the dataset (showing 8/38 columns). Consider the misspelling Abington, MD, which appears in 152 entries. The correct spelling Abingdon, MD occurs in only 42. However, PClean recognizes Abington, MD as an error because all 152 instances share a single address, and errors are modeled as happening systematically at the address level. Now consider PClean’s correct inference that K. Ryan’s degree is DO. PClean leverages the fact that her school PCOM awards more DOs than MDs, even though more Family Medicine doctors are MDs than DOs. The parameters that enable this reasoning are learned from the dirty dataset itself.
5 Discussion
Limitations and future work. PClean has limitations relative to probabilistic relational models Friedman1999: (1) PClean handles only acyclic class dependency graphs; (2) one object’s attributes can depend on another’s, but not on aggregate properties of sets of related objects; and (3) PClean models cannot encode priors over reference slots that depend on attributes, e.g. cinemas selecting movies to show based on genre (though such dependencies can arise in the posterior). Hierarchical Pitman-Yor processes and more sophisticated probabilistic programming architectures could potentially be used to lift these restrictions, without sacrificing scalability. PClean could also be extended in other ways, e.g. to leverage uncertainty quantification for interactive cleaning Kobren2019, or to synthesize error models and functional dependencies automatically Guo2019; Heidari2019.
Contributions
This paper has shown that it is possible to clean dirty, denormalized data from diverse domains, via domain-specific probabilistic programs. The modeling assumptions and inference hints in these programs require under 50 lines of code, and yield state-of-the-art accuracy compared to strong baselines using weighted logic and machine learning. This language also scales to datasets with millions of rows. The power of domain-specific probabilistic programming languages, that integrate sophisticated styles of modeling and inference developed over years of research into simple languages, has been demonstrated in other challenging domains such as 3D computer vision
Kulkarni2015; Mansinghka2015. PClean offers analogous benefits for Bayesian data cleaning. We hope that PClean is useful to practitioners, helping non-experts clean data that would otherwise be unusable for analysis, and that the success of this approach encourages AI and probabilistic programming researchers to develop domain-specific probabilistic programming languages for other important problem domains.6 Broader Impact
The widespread need for clean data. Many organizations struggle with dirty data — including large corporations, schools, non-profits, and governmental agencies Rahm2000
. In surveys, data analysts report that cleaning dirty data takes up more than sixty percent of their time, and the economic cost of unclean data has been estimated at trillions of dollars
Press2016; Redman2016. PClean represents a step toward a reliable, broadly accessible data-cleaning approach that is simultaneously customizable and robust enough to be applicable to a broad range of problems in multiple domains.Public interest uses for data cleaning. Data cleaning is a central bottleneck in data journalism, social science, and policy evaluation halevy2012data
. We intend to assist people seeking to apply PClean to these kinds of public interest use cases, especially where the data comes from manual entry or scraping from volunteers. We plan to open-source the software as a Julia package, eliminating financial obstacles to PClean’s use.
Harmful uses for data cleaning. PClean could make it easier for companies, governments, and political organizations to link personal information from multiple sources Sweeney2001. Research on de-anonymization has shown repeatedly that vulnerable populations are easier to de-anonymize Horawalavithana2019. Better data cleaning technology, applied by oppressive "surveillance states" and/or for-profit companies whose business models depend on surveillance of consumers, could make these vulnerable people even easier to identify and exploit or persecute.
Data biases. PClean’s inference algorithm may infer parameters or impute missing values according to biases encoded in the source data. PClean could thus produce imputations that reflect systemic discrimination. One potentially mitigating factor is that unlike many machine learning approaches to data cleaning, PClean makes explicit generative assumptions. It is thus conceptually straightforward to prevent PClean from incorporating known biases, e.g. by requiring columns containing information about race, gender, or sexual orientation to be modeled independently of outcomes.
Acknowledgments
The authors are grateful to Zia Abedjan, Divya Gopinath, Marco Cusumano-Towner, Raul Castro Fernandez, Cameron Freer, Christina Ji, Tim Kraska, Feras Saad, Michael Stonebraker, and Josh Tenenbaum for useful conversations and feedback. This work is supported in part by a Facebook Probability and Programming Award, the National Science Foundation, and a philanthropic gift from the Aphorism Foundation.
References
Appendix A PClean model programs
In this section, we give the PClean source code for all models used in experiments. Code for PClean’s implementation itself, as well as for running these PClean models, is also included in the accompanying .zip file.
a.1 Hospital
The Hospital dataset is modeled with six latent classes. Typos are modeled as independently introduced for each cell of the dataset. Some fields are modeled as draws from broad priors over strings, whereas others are modeled as categorical draws whose domain is the set of unique observed values in the relevant column (some of which are in fact typos).
Inference hints are used to focus proposals for string_prior choices on the set of strings that have actually been observed in a given column, and also to set a custom subproblem decomposition for the Obs class (all other classes use the default decomposition).
a.2 Flights
We first introduce the model for Flights that we used to obtain the results presented in Section 4.1. We then show variants of the model with less domain knowledge incorporated, which are shown to perform worse.
The primary model is shown below. In the parameter declaration for error_probs, we use the syntax error_probs[_] beta(10, 50) to introduce a collection of parameters; the declared variable becomes a dictionary, and each time it is used with a new index, a new parameter is instantiated. We use this to learn a different error_prob parameter for each tracking website. We could alternatively declare error_prob as an attribute of the TrackingWebsite class. However, PClean’s inference engine uses smarter proposals for declared parameters (taking advantage of conjugacy relationships), so for our experiments, we use the parameter declaration instead. We hope to extend automatic conjugacy detection to all attributes, not just parameters, in the near future.
As in Hospital, we use observed_values to provide inference hints to the broad time_prior; this expresses a belief that the true timestamp for a certain field is likely one of the timestamps that has actually been observed, in the dirty dataset, with the given flight ID.
Alternative models. We now briefly list the alternative models (Models 1-3) from Section 3.2. The unmodeled keyword is used to introduce a value that is guaranteed to be observed, but which is not modeled as a random variable.
a.3 Rents
The program we use for Rents is shown below. We model the fact that the rent may be in grand instead of dollars, as well as that the county name may contain typos. We introduce an artificial field, block, consisting of the first and last letters of the observed (possibly erroneous) County field, and use it to inform an inference hint: we hint that posterior mass for a county’s name concentrates on those strings observed somewhere in the dataset that share a first and last letter in common with the observed county name for this row. Without this approximation, inference is much slower (but potentially more accurate).
a.4 Physicians
The model for Physicians is given below. Many columns are not modeled. Similar to Rents, we use a parameter in the Physician class for degree_probs, although it might seem more natural to use an attribute of the School class; the resulting model is the same, but using parameter allows PClean to exploit conjugacy.
Appendix B Inference hints
PClean supports three types of inference hint: subproblem declarations, preferred values arguments, and guaranteed observation statements. In this section, we describe each, and give an empirical demonstration of how the results in Section 3 depend on them (Table 3).
b.1 Subproblem declarations
Subproblem declarations allow users to explicitly control which attributes and reference slots are included within each subproblem (see Section 3). Larger subproblems lead to more expensive subproblem proposals , but can lead to more accurate results. Users declare a subproblem by wrapping adjacent attribute and reference statements into a subproblem begin … end block.
To see the value of subproblem declarations, consider the Hospital program above. If we remove the subproblem begin … end inference hint around the second subproblem in that model, then each line is treated as its own subproblem. We ran cleaning using this modified model (but kept other settings equal) and obtained an score of only 0.18 (recall = 0.71, precision = 0.11).
b.2 Preferred values arguments
Preferred values arguments are optional arguments to distributions like string_prior that have infinite or very large support. The model itself does not change as a result of providing a preferred values argument. However, the proposal is adapted to reflect that posterior mass is expected to concentrate on the user-provided list of preferred values. In the models for this paper, we often specified preferred values equal to all observed values in a particular column, or values observed to co-occur with another value in a separate column. For example, we model the name of a city in the Hospital dataset as a string generated from a broad prior, but indicate that we expect posterior mass to concentrate on the set of strings that have actually been observed as city names in the dataset.
To illustrate the value of preferred values arguments, we perform two targeted experiments. First, for the Flights dataset, we consider removing the preferred values argument from the time_prior call for the scheduled departure time of a flight. This yields a diminished overall of 0.62 (recall = 0.70, precision = 0.55), due to mostly incorrect inferences about the scheduled departure time field. (Runtime is also slightly longer, because of the increased number of sampling operations from the time_prior proposal.) Second, we consider the effect of a preferred values argument that is very broad: we replace the more targeted preferred values argument for scheduled departure time with a list of all timestamps appearing anywhere in the Flights data (around 800 options). Inference quality is unaffected ( = 0.90, precision = 0.91, recall = 0.89), but running time is 7x longer: completing five iterations of particle Gibbs requires 95 seconds, instead of 13.
Preferred values arguments are a simple way to make inference in large discrete domains (e.g., strings) tractable. Researchers have also explored more sophisticated techniques for inference with strings [Winn2017, Yangel]. It may be possible to incorporate such strategies into a future version of PClean for more complex string-based inference, alleviating the need for preferred values arguments in some cases.
b.3 Guaranteed observation statements
Guaranteed observations are declared using the guaranteed keyword, which tells PClean that the value of a particular variable in a PClean model is guaranteed to be observed. This allows PClean to index object collections by these observed variable values, enabling fast lookups of all existing latent objects that are consistent with a particular observation of the variable. In the Flights dataset, removing the guaranteed statement from the flight_id attribute yields no change in inference results, but a modest 31% increase in running time (up to 17s from 13s). On a larger dataset, this runtime difference would be more pronounced.
Dataset + Inference Hints | F1 | Recall | Precision | Runtime |
---|---|---|---|---|
Hospital (with both subproblem blocks) | 0.91 | 0.83 | 1.0 | 39.9s |
Hospital (with only first subproblem block) | 0.18 | 0.71 | 0.11 | 37.8s |
Flights (with preferred values, guaranteed flight ID) | 0.90 | 0.89 | 0.91 | 13.5s |
Flights (no preferred values for departure time) | 0.62 | 0.70 | 0.55 | 17.6s |
Flights (overly broad preferred values for departure time) | 0.90 | 0.89 | 0.91 | 95.4s |
Flights (no guaranteed flight ID) | 0.90 | 0.89 | 0.91 | 17.2s |
Appendix C Synthetic Rent Dataset
Here we detail the generative process for the synthetic apartment rent dataset used to empirically evaluate PClean. The dataset was assembled using the County Population Totals table from the U.S. census and the 50th Percentile Rent Estimate table from the U.S. Department of Housing and Urban Development [USCensusBureau2019].
The clean dataset consists of 50,000 rows that were each generated in the following manner:
-
The county-state combination is chosen proportionally to its population in the United State
-
The size of the apartment is chosen uniformly from studio, 1 bedroom, 2 bedroom, 3 bedroom, 4 bedroom)
-
The rent is chosen according to a normal distribution in which the mean is the median rent for an apartment of the chosen size in the chosen country and the standard deviation is chosen to be 10% of the mean
The dataset was then dirtied in the following ways:
-
10% of state names are deleted (many counties exist across multiple states, e.g. 30 states have a Washington County).
-
Approximately 1-2% of county names are misspelled
-
10% of apartment sizes are deleted
-
1% of apartment prices are listed in the incorrect units (thousands of dollars, instead of dollars)