1 Introduction
Is it possible to make statistical inference broadly accessible to nonstatisticians without sacrificing mathematical rigor or inference quality? This paper describes BayesDB, a system that enables users to query the probable implications of their data as directly as SQL databases enable them to query the data itself. By combining ordinary SQL with three new primitives — SIMULATE, INFER, and ESTIMATE — users of BayesDB can detect predictive relationships between variables, retrieve statistically similar data items, identify anomalous data points and variables, infer missing values, and synthesize hypothetical subpopulations. The default modeling assumptions that BayesDB makes are suitable for a broad class of problems (Mansinghka et al., 2015; Wasserman, 2011), but statisticians can customize these assumptions when necessary. BayesDB also enables domain experts that lack statistical expertise to perform qualitative model checking (Gelman et al., 1995) and encode simple forms of qualitative prior knowledge.
BayesDB consists of four components, integrated into a single probabilistic programming system:

The Bayesian Query Language (BQL), an SQLlike query language for Bayesian data analysis. BQL programs can solve a broad class of data analysis problems using statistically rigorous formulations of cleaning, exploration, confirmatory analysis, and predictive modeling. BQL defines these primitive operations for these workflows in terms of Bayesian model averaging over results from an implicit set of multivariate probabilistic models.

A mathematical interface that enables a broad class of multivariate probabilistic models, called generative population models, to be used to implement BQL. According to this interface, a data generating process defined over a fixed set of variables is represented by (i) an infinite array of random realizations of the process, including any observed data, and (ii) algorithms for simulating from arbitrary conditional distributions and calculating arbitrary conditional densities. This interface permits many statistical operations to be implemented once, independent of the specific models that will be used to apply these operations in the context of a particular data table.

The BayesDB Metamodeling Language (MML), a minimal probabilistic programming language. MML includes constructs that enable statisticians to integrate custom statistical models — including arbitrary algorithmic models contained in external software — with the output of a broad class of Bayesian model building techniques. MML also includes constructs for specifying qualitative dependence and independence constraints.

A hierarchical, semiparametric Bayesian “metamodel” that automatically builds ensembles of generalized mixture models from database tables. These ensembles serve as baseline data generators that BQL can use for data cleaning, initial exploration, and other routine applications.
This design insulates end users from most statistical considerations. Queries are posed in a qualitative probabilistic programming language for Bayesian data analysis that hides the details of probabilistic modeling and inference. Baseline models can be built automatically and customized by statisticians when necessary. All models can be critically assessed and qualitatively validated via predictive checks that compare synthetic rows (generated via BQL’s SIMULATE operation) with rows from the original data. Instead of hypothesis testing, dependencies between variables are obtained via Bayesian model selection.
BayesDB is “Bayesian” in two ways:

In BQL, the objects of inference are rows, and the underlying probability model forms a “prior” probability distribution on the fields of these rows. This is then constrained by rowspecific observations to create a posterior distribution of field values. Without this prior, it would be impossible to simulate rows or infer missing values from partial observations.

In MML, the default metamodel is Bayesian in that it assigns a prior probability to a very broad class of probabilistic models and narrows down on probable models via Bayesian inference. This prior is unusual in that it encodes a state of ignorance rather than a strong inductive constraint. MML also provides instructions for augmenting this prior to incorporate qualitative and quantitative domain knowledge.
In practice, it is useful to use BQL for Bayesian queries against models built using nonBayesian or only partially Bayesian techniques. For example, MML supports composing the default metamodel with modeling techniques specified in external code that need not be Bayesian. However, the default is to be Bayesian for both model building and query interpretation, as this ensures the broadest applicability of the results.
This paper focuses on the technical details of BQL, the data generator interface, the metamodel, and the MML. It also illustrates the capabilities of BayesDB using three applications: cleaning and exploring a public database of Earth satellites, discovering relationships in measurements of macroeconomic development of countries, and analyzing salary survey data. Empirical results are based on a prototype implementation that embeds BQL into sqlite3, a lightweight, opensource, inmemory database.
1.1 A conceptual illustration
This section illustrates data analysis using the MML and BQL on a synthetic example based on analysis of electronic health records. SQL databases make it easy to load data from disk and run queries that filter and retrieve the contents. The first step in using BayesDB is to load data that describes a statistical (sub)population into a table, with one row per member of the population, and one column per variable:
CREATE POPULATION patients WITH DATA FROM patients.csv;
SELECT age, has_heart_disease FROM patients WHERE age > 30 LIMIT 3;
age has_heart_disease 66 ??? 44 yes 31 ???
Once data has been loaded, a population schema needs to be specified. This schema specifies the statistical characteristics of each example. For example, whether it is categorical or numerical, and if it is categorical, how many outcomes are there and how is each outcome represented. After an initial schema has been specified — using a mix of automatic inference and manual specification — the schema can be customized using instructions in the Metamodeling Language (MML).
GUESS POPULATION SCHEMA FOR patients;
ALTER POPULATION SCHEMA FOR patients
SET DATATYPE FOR num_hosp_visits TO COUNT;
CREATE DEFAULT METAMODEL FOR patients;
ALTER METAMODEL FOR patients ENSURE will_readmit DEPENDENT ON dialysis;
ALTER METAMODEL FOR patients
MODEL infarction GIVEN gender, age, weight, height, cholesterol, bp USING CUSTOM MODEL FROM infarction_regression.py;
One distinctive feature of MML is that it includes instructions for qualitative probabilistic programming. These instructions control the behavior of the automatic modeling machinery in the MML runtime. In this example, these constraints include the assertion of a dependence between the presence of a chronic kidney condition and future hospital readmissions. They also include the specification of a custom statistical model for the infarction variable, illustrating one way that discriminative and nonprobabilistic approaches to inference can be integrated into BayesDB.
The next step is to use the MML to build an ensemble of generalpurpose models for the data, subject to the specified constraints:
INITIALIZE 100 MODELS FOR patients;
ANALYZE patients FOR 3 HOURS CHECKPOINT EVERY 10 MINUTES;
Each of these 100 models is a generative population model
(GPM) that represents the joint distribution on all possible measurements of an infinite population with the given population schema. These models are initially drawn accordingt to a broad prior probability distribution over a large hypothesis space of possible GPMs. Until the observed data has been analyzed, BQL will thus report broad uncertainty for all its query responses.
Once the models are sufficiently adapted to the data, it is possible to query its probable implications. The following query quantifies over columns, rather than rows, and retrieves the probability of a marginal dependence between three (arbitrary) variables and height:
ESTIMATE COLUMN NAME, PROBABILITY OF DEPENDENCE WITH height
FROM COLUMNS OF patients LIMIT 3;
column name p( dep. with height ) height 1.0 infarction 0.08 gender 0.99
Point predictions can be accessed by using the INFER instruction, a natural generalization of SELECT from SQL:
INFER age, has_heart_disease FROM patients
WHERE age > 30 WITH CONFIDENCE 0.8 LIMIT 3;
age has_heart_disease 66 yes 44 yes 31 ???
In this example, only one of the missing values could be inferred with the specified confidence level. The probabilistic semantics of CONFIDENCE will be discussed later in this paper.
BQL also makes it straightforward to generate synthetic subpopulations subject to a broad class of constraints:
SIMULATE height, weight, blood_pressure FROM patients 3 TIMES
GIVEN gender = male AND age < 10
height weight blood_pressure 46 80 110 38 60 80 39 119 120
The SIMULATE
operator gives BQL users access to samples from the posterior predictive distribution induced by the implicit underlying set of models. This is directly useful for predictive modeling and also decisiontheoretic choice implemented using Monte Carlo estimation of expected utility
(Russell and Norvig, 2003). It also enables predictive checking: samples from SIMULATE can be compared to the results returned by SELECT. Finally, domain experts can use SIMULATE to scrutinize the implications of the underlying model ensemble, both quantitatively and qualitatively.2 Example Analyses
This section describes three applications of the current BayesDB prototype:

Exploring and cleaning a public database of Earth satellites.

Assessing the evidence for dependencies between indicators of global poverty

Analyzing data from a salary survey.
BQL and MML constructs are introduced via realworld uses; a discussion of their formal interpretation is provided in later sections.
2.1 Exploring and cleaning a public database of Earth satellites
The Union of Concerned Scientists maintains a database of 1000 Earth satellites. For the majority of satellites, it includes kinematic, material, electrical, political, functional, and economic characteristics, such as dry mass, launch date, orbit type, country of operator, and purpose. Here we show a sequence of interactions with a snapshot of this database using the bayeslite implementation of BayesDB.
2.1.1 Inspecting the data.
We start by loading the data and looking at a sample. This process uses a combination of ordinary SQL and convenience functions built into bayeslite. The first step is to create a population from the raw data:
CREATE POPULATION satellites FROM ucsdatabase.csv
One natural query is to find the International Space Station, a wellknown satellite:
SELECT * FROM satellites WHERE Name LIKE ’International Space Station%’
Variable  Value 

Name  International Space Station (ISS …) 
Country_of_Operator  Multinational 
Operator_Owner  NASA/Multinational 
Users  Government 
Purpose  Scientific Research 
Class_of_Orbit  LEO 
Type_of_Orbit  Intermediate 
Perigee_km  401 
Apogee_km  422 
Eccentricity  0.00155 
Period_minutes  92.8 
Launch_Mass_kg  NaN 
Dry_Mass_kg  NaN 
Power_watts  NaN 
Date_of_Launch  36119 
Anticipated_Lifetime  30 
Contractor  Boeing Satellite Systems (prime)/Multinational 
Country_of_Contractor  Multinational 
Launch_Site  Baikonur Cosmodrome 
Launch_Vehicle  Proton 
Source_Used_for_Orbital_Data  www.satellitedebris.net 12/12 
longitude_radians_of_geo  NaN 
Inclination_radians  0.9005899 
This result row illustrates typical characteristics of realworld databases such as heterogeneous data types and missing values.
2.1.2 Building baseline models.
Before exploring the implications of the data, it is necessary to obtain a collection of probabilistic models. The next two MML instructions produce a collection of 16 models, using roughly 4 minutes of analysis total.
INITIALIZE 16 MODELS FOR satellites;
ANALYZE satellites FOR 4 MINUTES WAIT;
Each of the 16 models is a separate GPM produced by an independent Markov chain for approximate posterior sampling in the default semiparametric factorial mixture metamodel described earlier. This number of models and amount of computation is typical for the exploratory analyses done with our prototype implementation; this is sufficient for roughly 100 full sweeps of all latent variables.
2.1.3 Answering hypotheticals.
The satellites database should in principle inform the answers to a broad class of hypothetical or “what if?” questions. For example, consider the following question:
Suppose you receive a report indicating the presence of a previously undetected satellite in geosynchronous orbit with a dry mass of 500 kilograms. What countries are most likely to have launched it, and what are its likely purposes?
Answering this question requires knowledge of satellite engineering, orbital mechanics, and the geopolitics of the satellite industry. It is straightforward to answer this question using BQL. The key step is to generate a synthetic population of satellites that reflect the given constraints:
SIMULATE country_of_operator, purpose FROM satellites
GIVEN Class_of_orbit = GEO, Dry_mass_kg = 500 LIMIT 1000;
Figure 1 shows the results of a simple aggregation of these results, counting the marginal frequencies of various countries and purposes and sorting accordingly. The most probable explanation, carrying roughly 25% of the probability mass, is that it is a communications satellite launched by the USA. It is also plausible that it might have been launched other major space powers such as Russia or China, and that it might have a military purpose.
The satellites data are too sparse and ambiguous for frequency counting to be a viable alternative. Consider an approach based on finding satellites that match the discrete GEO constraint and are within some adhoc tolerance around the observed dry mass:
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
FROM satellites
WHERE Class_of_orbit = "GEO" AND Dry_Mass_kg BETWEEN 400 AND 600;
This SQL query returns just 2 satellites, both Indian:
Country_of_Operator  Purpose  Class_of_Orbit  Dry_Mass_kg  

0  India  Communications  GEO  559 
1  India  Meteorology  GEO  500 
Presuming our intuition about satellite mass is flawed, we might issue another query to look at a broader range of satellites:
SELECT country_of_operator, purpose, Class_of_orbit, Dry_mass_kg
FROM satellites
WHERE Class_of_orbit = ’GEO’ AND Dry_Mass_kg BETWEEN 300 AND 700
The results still do not give any real insight into the likely purpose of this satellite:
Country_of_Operator  Purpose  Class_of_Orbit  Dry_Mass_kg  

0  Malaysia  Communications  GEO  650 
1  Israel  Communications  GEO  646 
2  Luxembourg  Communications  GEO  700 
3  Russia  Communications  GEO  620 
4  China (PR)  Earth Science  GEO  620 
5  China (PR)  Earth Science  GEO  620 
6  China (PR)  Earth Science  GEO  620 
7  India  Communications  GEO  559 
8  India  Navigation  GEO  614 
9  India  Meteorology  GEO  500 
10  Malaysia  Communications  GEO  650 
11  Multinational  Earth Science/Meteorology  GEO  320 
12  United Kingdom  Communications  GEO  660 
13  Norway  Communications  GEO  646 
Without deep expertise in satellites, and significant expertise in statistics, it is difficult to know whether or not these results can be trusted. How does the set of satellites vary as the thresholds on Dry_Mass_kg are adjusted? How locally representative and comprehensive is the coverage afforded by the data? Are there indirect, multivariate dependencies that ought to be taken into account, to determine which satellites are most similar? How should existing satellites be weighted to make an appropriate weighted sample against which to calculate frequencies? In fact, small modifications to the tolerance on Dry_Mass_kg yield large changes in the result set.
2.1.4 Identifying predictive relationships between variables.
A key exploratory task is to identify those variables in the database that appear to predict one another. This is closely related to the key confirmatory analysis question of assessing the evidence for a predictive relationship between any two particular variables.
To quantify the evidence for (or against) a predictive relationship between two pairs of variables, BQL relies on information theory. The notion of dependence between two variables A and B is taken to be mutual information; the amount of evidence for dependence is then the probability that the mutual information between A and B is nonzero. If the population models are obtained by posterior inference in a metamodel — as is the case with MML — then this probability approximates the posterior probability (or strength of evidence) that the mutual information is nonzero.
ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF satellites;
Figure 2 shows the results from this query. There are several groups of variables with high probability of mutual interdependence. For example, we see a definite block of geopolitically related variables, such as the country of contractor & operator, the contractor’s identity, and the location of the satellite (if it is in geosynchronous orbit). The kinematic variables describing orbits, such as perigee, apogee, period, and orbit class, are also shown as strongly interdependent. A domain expert with sufficiently confident domain knowledge can use this overview of the predictive relationships to assess the value of the data and the validity of the baseline models.
It is also instructive to compare the heatmap of pairwise dependence probabilities with alternatives from statistics. Figure 2 also shows heatmap that results from datatypeappropriate measures of correlation. The results from correlation are sufficiently noisy that it would be difficult to trust inferences from techniques that use correlation to select variables. Furthermore, the most causally unambiguous relationships, such as the kinematic constraints relating perigee, apogee, and orbital period, not detected by correlation.
2.1.5 Detecting multivariate anomalies.
Another key aspect of exploratory analysis is identifying anomalous values, including both (univariate) outliers and multivariate anomalies. Anomalies can arise due to errors in data acquisition, bugs in upstream preprocessing software (including binning of continuous variables or translating between different discrete outcomes), and runtime failures. Anomalies can also arise due to genuine surprises or changes in the external environment.
Using BQL, multivariate anomalies can be detected by assessing the predictive probability density of each measurement, and ordering from least to most probable. Here we illustrate this using a simple example: ordering the geosynchronous satellites according to the probability of their recorded orbital period:
ESTIMATE name, class_of_orbit, period_minutes AS TAU,
PREDICTIVE PROBABILITY OF period_minutes AS "Pr[TAU]"
FROM satellites
ORDER BY ‘‘Pr[TAU]’’ ASCENDING LIMIT 10
This BQL query produces the following table of results:
Name  Class_of_Orbit  TAU  Pr[TAU]  

0  AEHF3 (Advanced Extremely High Frequency sate…  GEO  1306.29  0.001279 
1  AEHF2 (Advanced Extremely High Frequency sate…  GEO  1306.29  0.001292 
2  DSP 20 (USA 149) (Defense Support Program)  GEO  142.08  0.002656 
3  Intelsat 903  GEO  1436.16  0.003239 
4  BSAT3B  GEO  1365.61  0.003440 
5  Intelsat 902  GEO  1436.10  0.003492 
6  SDS III6 (Satellite Data System) NRO L27, Gr…  GEO  14.36  0.003811 
7  Advanced Orion 6 (NRO L15, USA 237)  GEO  23.94  0.003938 
8  SDS III7 (Satellite Data System) NRO L38, Dr…  GEO  23.94  0.003938 
9  QZS1 (QuaziZenith Satellite System, Michibiki)  GEO  1436.00  0.004446 
Recall that a geosynchronous orbit should take 24 hours or 1440 minutes. Rows 7 and 8 appear to be unit conversion errors (hours rather than minutes). Rows 2 and 6 appear to be decimal placement errors. Note that row 2 is not an outlier: some satellites have an orbital period of roughly two hours. It is only anomalous in the context of the other variables that are probably predictive of orbital period, such as orbit class.
2.1.6 Inferring missing values.
A key application of predictive modeling is inferring point predictions for missing measurements. This can be necessary for cleaning data before downstream processing. It can also be of intrinsic interest, e.g. in classification problems. The satellites database has many missing values. Here we show an INFER query that infers missing orbit types and returns both a point estimate and the confidence in that point estimate:
INFER EXPLICIT
anticipated_lifetime, perigee_km, period_minutes, class_of_orbit,
PREDICT type_of_orbit AS inferred_orbit_type CONFIDENCE inferred_orbit_type_conf
FROM satellites
WHERE type_of_orbit IS NULL;
This form of INFER uses the EXPLICIT modifier that exposes both predicted values and their associated confidence levels to be included in the output. Figure 3 shows a visualization of the results. The panel on the bottom left shows that the confidence depends on the orbit class and on the predicted value for the inferred orbit type. For example, there is typically moderate to high confidence for the orbit type of LEO satellites — and high confidence (but some variability in confidence) for those with SunSynchronous orbits. Satellites with Elliptical orbits may be assigned a SunSynchronous type with moderate confidence, but for other target labels confidence is generally lower. After examining the overall distribution on confidences, it can be natural to filter INFER
results based on a manually specified confidence threshold. Note that many standard techniques for imputation from statistics correspond to
INFER ... WITH CONFIDENCE 0.2.1.7 Integrating a kinematic model for elliptical orbits.
Can we improve over the baseline models by integrating causal knowledge about satellites? MML can be used to compose GPMs built by the default model builder with algorithmic and/or statistical models specified as external software. Here we integrate a simple model for elliptical orbits:
ALTER METAMODEL FOR satellites
MODEL perigee_km, apogee_km GIVEN period_minutes, eccentricity
USING CUSTOM MODEL FROM stochastic_kepler.py;
ANALYZE FOREIGN PREDICTORS FOR 1 MINUTE;
The underlying foreign predictor implements Kepler’s laws:
Here, is set via the eccentricity measurement for row , and is set via the period_minutes measurement. is a fixed constant inside the foreign predictor representing the radius of the Earth in kilometers. is a parameter that determines the noise and is set when the variable subset for the foreign predictor instantiation is ANALYZEd. Note that foreign predictors are essentially GPMs and accordingly must implement generic simulate(…) and logpdf(…) methods. For algorithmic forward models with numerical outputs, MML provides a default wrapper that uses importance sampling with resampling to approximately generate conditional samples and estimate marginal densities. This is how Kepler’s laws — in the form of a forward simulation — are turned into a generative model for kinematic variables that can be conditionally simulated in arbitrary directions.
Figure 4 and Figure 5 show the results. A detailed discussion of the relative merits of empirical versus analytical modeling is beyond the scope of this paper. However, it is clear that neither the empirical approach nor the analytical approach is universally dominant. The empirical approach is able to correctly locate the empirical probability mass — including multiple modes — but underfits. The orbital mechanics approach yields inferences that are typically in much tighter accord with the kinematic data. This is unsurprising: these are the patterns of covariation that led to the development of quantitative models and “hard” natural sciences. However, there are many satellites for which Kepler’s laws are not in accord with the data. This reflects many factors, including data quality errors as well as legitimate gaps between idealized mathematical laws and finegrained empirical records of realworld phenomena. For example, at the time Kepler’s laws were formulated, orbiting bodies lacked engines.
2.1.8 Combing random forests, causal models, and nonparametric Bayes.
Because MML supports model composition, it is straightforward to build hybrid models that integrate techniques from subfields of machine learning that might seem to be in conflict. Figure 6
shows the transcript of a complete MML session that builds such a hybrid model. Random forests are used to classify orbits into types; Kepler’s Laws are used to relate period, perigee, and apogee; and the default semiparametric Bayesian metamodel is used for all remaining variables (with two variables coming with overridden datatypes).
Later sections of this paper explain how these three modeling approaches are combined to answer individual BQL queries.
2.2 Assessing the evidence for dependencies between indicators of global poverty
In the early 21st century it is widely believed that resolving extreme poverty around the world will be accomplished by empowering individuals to resolve their poverty. Governments and NGOs encourage this process through a variety of interventions, many of them combining material assistance with policy changes. In principle, policies should be driven by quantitative datadriven understanding of international economic development. In practice, international economic data is sparse, unreliable, and highly aggregated. These data limitations create substantial obstacles to understanding the situational context of successful or unsuccessful interventions and policies.
The “Gapminder” data set, collected and curated by Hans Rosling at the Karolinska Institutet, is the most well known and extensive sets of longitudinal global developmental indicators. Representing over 500 indicators, 400 countries, and 500 years of data, it covers the colonial era, industrial revolution, sociopolitical upheavals around the world in the 20th century, and the first decade of the 21st. Containing over 2 million observations, the data has been used as the basis for a compelling set of data animations and the most widely viewed TED talk on statistics.
To date, analysis of this data has been minimal, as it requires intensive preprocessing and cleaning. Different analytical methdologies require different approaches to imputation and variable selection; as a result, results from different teams are difficult to compare. Here we show how to explore the data with BayesDB and assess the evidence for predictive relationships between different macroeconomic measures of development.
2.2.1 Exploring the Data with SQL
The raw form of the data is 500 Excel spreadsheets, each containing longitudinal data for 300 countries over 100 years. However, the dataset only contains 2 million observations, i.e. 97% of the data is missing. Figure 7
shows key indicators of the data around size, missing records, and the relationship between data availability and countries, records, and years. The primary data is mmodeled in SQL as a “fact” table structure. This relativelynormalized representation easily models the sparse matrix and allows us to use a combination of SQL and Python data science tools to craft our population structures.
The histograms in Figure 7 show the breadth and also the variability of the data. The histogram by year in Figure 6(a) shows that data is complete for only recent history, and in fact that some predicted data continues into the future, and that data for some indicators is only available every 10 years. The histogram by country in Figure 6(a) shows that data availability varies by country (the most described country is Sweden), that many countries have reasonably complete data, but that there is a long tail of countries with sparse data, including countries that no longer exist and with inconsistent or disputed naming. Figure 6(c)
shows that there is also a variance by indicators, because different measurements are collected by different agencies with different expectations and data policies.
The data has already been subject to extensive visualization and descriptive analytics by the Gapminder project. This paper focuses on the use of MML to model the data and BQL to query its probable implications.
2.2.2 Detecting Basic and Longitudinal Dependence
Our analysis focuses on the 53 variables with most complete data for the years 19992008. It is straightforward to create an ensemble of models for this subset:
GUESS POPULATION SCHEMA FOR dense_gapminder;
INITIALIZE 64 MODELS FOR dense_gapminder;
ANALYZE todo FOR 300 MINUTES WAIT;
The probability of dependence heatmap that results is shown in Figure 8. Indicators such as total population and urban percentage form blocks containing their values for all 10 years contained in the dataset. This shows that the default GPM was able to extract the temporal dependence in these indicators. In other cases, such as measurements of the number of people killed in floods, the year to year dependence is much weaker. The heatmap also shows dependence between indicators, such as the block in the top right corner combining indicators of stress, urbanization, and fertility rate. Finally, it segregates data according to type sof indicators, as can be seen in Figure 7(b) where there is a sharp break from total measurements to percapital.
If we analyze just the data for the year 2002, using 32 models for 3 minutes, we get a heatmap that highlights the dependence and independence between indicators. Figure 9 shows the details.
2.2.3 Measuring the Similarity of Countries
In order to help with the delivery of international aid and the design and analysis of interventions, decision makers often want a richer understanding of the similarities between countries. With BQL we can formulate these queries in general or against specific attributes. Figure 10 shows country similarities for different indicators. As expected, changing the indicator of interest can produce a very different similarity structure. Analyses that presume a single global similarity measure cannot pick up this contextspecific structure.
The authors are involved in an ongoing research partnership with the Bill and Melinda Gates Foundation aimed at integrating the Gapminder data with other relevant sources, including qualitative knowledge from domain eperts, and using it to drive empirically grounded policy and aid interventions.
2.3 Analyzing a salary survey
Surveys are a common source of multivariate data and a potentially appealing application for BayesDB. Here we show a preliminary analysis of a webadministered anonymous salary survey. Participants shared their compensation details along with information about their title, years of service, acheivements, employer, and geography.
2.4 Controlling Models with Qualitative Assumptions
This salary population provides an instructive example of applying qualitiative assumptions to a model. In this case, the first analysis of compensation data finds that geographic location (state, region) is not a factor in compensation. Domain experts suggest that is implausible, that cost of living and the competitive market in different cities is a significant factor in compensation of the survey participants. The following code can be used to apply this qualitative assumption:
ALTER METAMODEL FOR salary ENSURE total, equity, base, bonus DEPENDENT ON state;
Without asserting the dependence, state is inferred to be dependent on region and independent of performance. After asserting a qualitative constraint, the probability of dependence heat map changes. Not only are the squares implied by that depencence colored to 1.0, but other columns have realigned in their modeling. In particular, given this assumption, there appears to be more evidence of dependence between the 2012 and 2013 measures and core indicators such as years in the job, bonus, the presence of an equity stake, etc. Also, there appears to be less evidence that job title impacts the key compensation variables.
3 The Bayesian Query Language
The Bayesian Query Language (BQL) formalizes Bayesian data analysis without exposing the end user to model parameters, priors, and posteriors. This section describes the statistical operations that are implemented by the core BQL instruction set. It also describes the modeling formalism that is used to implement BQL.
3.1 Generative Population Models
BQL programs are executed against a weighted collection of generative population models (GPMs). At present, GPMs can be built in two ways:

Specified directly as external software libraries.

Inferred from data via probabilistic inference in a metamodel written in BayesDB’s Metamodeling Language.
GPMs can respond to queries about the joint distribution of the underlying data generating process as a whole or about the predictive distribution for a specific member of the population. The population can be thought of as a table, where individual members are specified by row indexes.
Each GPM induces a random table with a finite number of columns and an infinite number of rows, where each cell contains a random variable. BQL treats each BayesDB generator as a model of the data generating process underlying its associated table of observations. It is sometimes useful to query a GPM about hypothetical members of the population. This can be performed by using a row whose index
may not be associated with any actual member; this can be guaranteed by generating a unique row index.Each GPM is described by a schema that must be compatible with the population schema for the population to which it is being applied. This schema is a tuple containing . The typedoutputs component specifies the column indexes and statistical types of each column that the data generator will be responsible for producing. The typedinputs component specifies the indexes and statistical types of each column that the data generator can read from. The body is an opaque binary that contains any GPMspecific configuration information, such as a probabilistic program.
Mathematically, the internals of a GPM consists of three parts:

Measurementspecific latent variables .
There may be overlap between the latent variables for different measurements. If a GPM cannot track dependencies internally — or if it is based on a model class in which all measurements are coupled — then .

Populationlevel latent variables .
These are all latent variables that remain welldefined in the absence of all measurements. Examples include hyperparameters and mixture component parameters.

Observations .
These correspond to the observed measurements.
For example, a naive Bayesian GPM lacks any measurementspecific variables, i.e.
, and is completely characterized by a single vector of parameters
for the probability models for each column. A finite mixture GPM would have be the cluster assignment for each row, and have be the component model parameters for each cluster.Generative population models are required to satisfy the following conditional independence constraint:
Note in particular that the observations need not be conditioned on directly, given and
. This formalizes the requirement that the dependencies between the measurements in the population are completely mediated by the populationlevel latent variables and all relevant measurementspecific latent variables. In general, no other independence constraints are enforced by the interface. GPMs can thus be built around dense, highlycoupled model families such as lowdimensional latent spaces and convolutional neural networks.
3.1.1 An interface to generative population models
A GPM must implement the following interface:

= = initialize( schema = )
Initialize a data generator with the given schema and return the resulting data generator . It ensures that storage has been allocated for the random variables and , storing the global latent variables and the local latent variables, respectively.

= simulate(, givens = , targets = , )
Generate sampled values from the specified distribution:
The set of valid distributions includes all finitedimensional joint distributions obtainable by conditioning on the values of arbitrary measurements and marginalizing over another arbitrary set.

= logpdf(, givens = , query = )
Evaluate the log probability density of the specified conditional/marginal distribution at a target point:

= kldivergencegivenG(, measurementsA = , measurementsB = , conditionsC = )
This estimates the KL divergence of the set of measurements from the set of measurements , conditioned on the given constraints . KL calculations are central to modelindependent data analysis. For example, to detect predictive relationships, it suffices to check for nonzero mutual information, which can be reduced to calculating the KL between the joint distribution over two variables and the product of the marginals.
It is included in the GPM interface because that allows a GPM implementer to supply an optimized implementation. Where such an implementation is not available, the KL can be estimated via simple Monte Carlo estimation:
This interface is intentionally quite general. It needs to support an open set of primitives for Bayesian data analysis. This paper focuses on the subset of this interface where all measurements come from the same row. All the BQL operations used in this paper can be reduced to explicit invocations of ,
, and to Monte Carlo estimates of KullbackLeibler divergences implemented in terms them. Some GPMs can significantly optimize some of these operations relative to Monte Carlo baselines; such optimizations are likely to be important in practice but are beyond the scope of this paper.
3.1.2 Weighted collections of generative population models.
BQL is executed against a weighted collection of GPMs :
In principle, these collections can include GPMs drawn from different model classes. The weights are treated as prior probabilities. This paper focuses on the case where the GPMs come from a single metamodel, each produced by independent runs of a single Markov chain for posterior inference in the metamodel given all available measurements. In this case, assigning unit weights to all models results in BQL queries based on a Monte Carlo approximation to Bayesian model averaging.
3.2 Core instructions: Simulate, Estimate, and Infer
Data analysis workflows in BQL are built around three core classes of statistical operations:

Generating samples from predictive probability distributions, including both completions of existing rows in a data table as well as predictive distributions over hypothetical rows.

Estimating predictive probability densities and approximating derived informationtheoretic quantities.

Summarizing multimodal probability distributions with single values.
These capabilities are exposed via three basic extensions to SQL that each combine results from individual GPMs in different ways. They can be composed with ordinary SQL to solve a broad range of data analysis tasks:

Detecting predictive relationships between variables: ESTIMATE COLUMN PROBABILITY OF DEPENDENCE WITH ...
This yields an estimate of the marginal probability of dependence between the specified columns. This is equivalent to the probability that the mutual information between those two variables is nonzero, integrating over the weighted collection of GPMs that BayesDB maintains. If the GPMs are produced by an asymptotically consistent estimator of the joint distribution, then these probabilities will reflect nonlinear, heteroscedastic, or contextspecific dependencies that statistical aggregates (such as correlation or linear regression coefficients) will not.

Each of these predictive modeling tasks requires filling in point estimates in different conditions. All of these can be viewed as special cases of INFER, which handles arbitrary patterns of missing values and both continuous and discrete prediction targets.

Anomalous cells can be found by predictive checking: identify the cells that are least likely under the inferred constellation of models. These may not be outliers in the standard univariate sense: the low probability may be due to interactions between several variables, even though each variable on its own is marginally typical.

Retrieving similar rows: ORDER BY SIMILARITY TO row
A broad class of structured search operations can be performed via informationtheoretic measures of similarity between rows. These are useful in both data exploration and in more targeted search.

Predictive model checking: SIMULATE ...
By comparing aggregates from the output of SIMULATE to the output of the analogous SELECT statements, it is possible to do predictive checking without having to mention models, parameters, priors, or posteriors.
3.2.1 Simulate: generating samples from arbitrary predictive distributions.
The first, called SIMULATE, provides a flexible interface to sampling from posterior predictive distributions:
SIMULATE target columns FROM population [WHERE row filter] [ASSUMING constraint] [k TIMES]
The WHERE clause is interpreted as a constraint to test against all members of the population that have been observed so far. If it is not supplied, the SIMULATE command is executed against an arbitrary asyetunobserved member of the population, i.e. a unique row id from the standpoint of the GPM interface. The ASSUME clause is interpreted as an additional set of constraints to condition each row on before generating the simulations.
For example, to generate a proxy dataset of two variables varA and varB, one can write SIMULATE varA, varB FROM population 100 TIMES. As another example, consider the BQL command SIMULATE varA, varD FROM population 20 TIMES WHERE varB = True AND varC IS MISSING ASSUMING varC = 3.4. This generates 20 simulated values from for each member of the population where varB is equal to True and varC is missing. This behavior may seem nonintuitive. For example, a SIMULATE invocation with WHERE true returns rows, where is the number of rows in the database and is the number of output samples specified with the query. On the other hand, WHERE false yields an empty result set, always. However, this semantics allows SQL aggregates to reduce the predictions for individual source rows by grouping on the row identifiers.
To formally describe the meaning of simulate, we first introduce some notation. Let be the predicate denoted by the WHERE clause, i.e. if the predicate is satisfied and 0 otherwise. Let be the set of rows for which there is at least one measurement, i.e. , and let be the set of rows that satisfy the WHERE clause’s filter. If a WHERE clause is not provided, then for all existing rows , and be a set containing a single distinguished row about which no measurements are known. Let be the set of target columns, and let be the set of assumed equality constraints. Also let be the set of all columns referenced in the SIMULATE command.
For each , the SIMULATE primitive produces a set of returned realizations of the following generative process:
This corresponds to choosing a GPM at random according to the probabilities given by their weights and then generating from the conditioned distribution in that model. If the models are equally weighted, i.e. , and if all the GPMs are drawn from their posterior distribution given the observations , then this procedure implements sampling from the Bayesian posterior predictive distribution over the targets given all the observed data plus the additional constraints from the ASSUME clause:
3.2.2 Estimate: approximating posterior averages.
The second core BQL primitive, called ESTIMATE, allows clients to query the posterior expectations of stochastic functions that are defined over the rows and the columns:
ESTIMATE target properties FROM [COLUMNS OF] table [WHERE row/col filter]
Rowwise estimands provided by BQL. Consider the case where the rows are being queried, i.e. COLUMNS OF does not occur in the query. Let be the set of properties whose values are requested. These properties can depend on observed measurements as well as latent components of the GPM. Let implement the WHERE clause’s filter, as with SIMULATE. If a WHERE clause is not provided, then for all existing rows .
Given these definitions, each row in the output of this class of ESTIMATE invocations is defined as follows:
The total set of returned rows is defined by the where clause:

= predictiveprobability
This estimand is denoted PREDICTIVE PROBABILITY OF col, and applied against an implicitly specified row, thus picking out a single measurement in the population. It can be implemented by delegation to the underlying GPM:
predictiveprobability() = logpdf()
This can be used to identify outliers — measurements that are unlikely under their marginal distribution — as well as anomalous measurements that are marginally likely but unlikely given the other measurements for the same row.

sim(a,b) = generativesimilarity(, context = , rowA = , rowB = )
Data analysts frequently want to retrieve rows from a table that are “statistically similar” to some preexisting or hypothetical row. This is a key problem in data exploration. It is also useful when trying to explain surprising inference results or when trying to diagnose and repair data or inference quality issues. Many machine learning techniques treat similarity as a central primitive, and use a metric formulation of similarity as the basis for inductive generalization.
Information theory provides appealing alternatives: measure similarity in terms of the amount of information one row contains about the values in another. This can be assessed against all variables or just against a “context” that is defined by a particular subset of variables. One approach, leading to a directional measure, is to measure the divergence of the distribution over values in one row from the distribution over values in another:
Columnwise estimands provided by BQL. Another use of ESTIMATE is to query properties of the columns, via ESTIMATE ... FROM COLUMNS OF .... Let be a distinguished row about which no measurements are known, i.e. . Let be a function of a set of measurements from a fresh row and the underlying GPM. It is then straightforward to define the set of values needed to check the WHERE filter, the set of columns that satisfy the filter, and the set of returned values containing all the target expressions for each satisfying column.

= marginaldependenceprob
This estimand characterizes the amount of evidence for the existence of a predictive relationship between the pair of variables and . It is defined according to the informationtheoretic definition of conditional independence:
If each weighted GPM is sampled approximately from some Bayesian posterior (and identically), then simple Monte Carlo estimation of the marginal dependence probability yields an estimate of the posterior marginal dependence probability:

= mutualinformation
The mutual information between two columns can be estimated by the standard reduction to KL divergence (Cover and Thomas, 2012). This complements the marginal dependence probability, providing one measure of the strength of a dependence.
For convenience, some of the quantities that are ordinarily accessed via ESTIMATE are also made available via SELECT.
3.2.3 Infer: summarizing distributions with point estimates.
Predictive modeling applications sometimes require access to point predictions rather than samples from predictive distributions. BQL provides these capabilities using the INFER primitive. The difference between INFER and SELECT is that INFER incorporates automatic implicit imputation from the underlying collection of GPMs, plus filtering based on userspecified confidence thresholds. For simplicity, this paper describes a simplified version with a single threshold:
INFER target columns FROM table [WHERE row filter] WITH CONFIDENCE confidence level
This operation returns a set of measurements where unobserved measurements are filled in with point estimates if a prescribed confidence threshold is reached. More formally:
For discrete measurements, BQL implements in terms of predictive probability:
Optimal candidate estimates can be found by optimization, implemented via enumeration:
For continuous measurements, there is no canonical definition of confidence that applies to all GPMs. Here we define as the probability that there is a useful unimodal summary of the distribution of that captures at least percent of the predictive probability mass. This can be formalized in terms of mixture modeling. Let be the parameters of mixture component ; for continuous data, we will use Gaussian component models, so . Let be the mass associated with component . We will choose and as follows:
Note that this approach can recover the behavior of the chosen strategy for discrete data by using component models that place all their probability mass on single values. The current prototype implementation of BayesDB uses a standard Gibbs sampler for a Dirichlet process mixture model (Mansinghka et al., 2015; Neal, 1998; Rasmussen, 2000) to sample , with by default. Adjusting and the amount of inference done in this mixture model can yield a broad class of tradeoffs between time, accuracy, and variance; the current values are chosen for simplicity.
3.3 Model and data independence
Relational databases revolutionized the processing and analysis of business data by enabling a single centrally managed data base to shared by multiple applications and also shared between operational and analytic workloads. This in turn accelerated the development of high performance and efficient databases, because the common abstraction became a target for researchers and industrial practitioners looking to build high performance system software with a broad impact. The relational model enabled sharing and infrastructure reuse because interactions with the data, queries, are expressed in a notation (most popularly SQL) that is independent of the physical representation of the data (Codd, 1970). Without this independence, physical data layout must be carefully tailored to particular workloads, specialized code written to manipulate the layout, and these data formats and access methods cannot easily be shared.
BayesDB aims to provide additional abstraction barriers that insulate clients from the statistical underpinnings of data analysis. Clients need to be able to specify data analysis steps and workflows in a notation that is independent of the models and runtime inference strategies used to implement individual primitives, and (where possible) the modeling strategies used to produce models from the original data.
Recall that the complete persistent state of a single population in BayesDB is characterized by two mathematical objects:

The complete set of observed measurements .

The weighted collection of GPMs . Note that this notation makes no commitment as to the content of the GPMs, the weights, or the procedures by which they were obtained.
The independencies provided by BayesDB can be described in terms of these objects:

Physical data independence. The notation for makes no commitment as to the physical representation of the measurements. The definitions of BQL primitives given above therefore do not depend on details of the data representation to define their values. However, as with SQL, small changes in representation may yield large changes in runtime performance.

Physical model independence. The notation for each makes no commitment as to the specific probability distribution induced over the set of random measurements . The definitions of BQL primitives given above therefore do not depend on the details of the probabilistic models used to define the random result set for each query. In principle, the mathematical properties of the models as well as their software implementation (or even implementing platform) can be changed without invalidating end user queries. However, small changes in the generative population model may yield large changes in the results of and .
Databases provide other finergrained independence properties that may have useful analogs in BayesDB. For example, let us partition the random variables induced by a given GPM into two subsets and . An example of a desirable datadependent independence property is that if in the “true” GPM, then in any inferred models. Informally, this rests on the modelbuilding strategy: if the modelbuilder recovers the correct independencies, then the independence of query results follows. This can be thought of as an analogue of logical data independence, which stipulates that e.g. adding new features should not affect the behavior of existing applications whose results do not depend on the value of these new features. Formalizing and verifying these properties is an important challenge for future research.
4 Modeling with the MetaModeling Language
BayesDB also provides the MetaModeling Language (MML), a probabilistic programming language for building models of data tables. MML programs consist of modeling tactics that control the behavior of an automatic modelbuilding engine. These tactics take several forms: statistical datatypes; initialization of weighted collections of random models; approximately Bayesian updating of the model collection; qualitative assertions about dependence and independence between variables; and the use of custom statistical models for specific conditional distributions. All these tactics are currently implemented in terms of a unifying semiparametric Bayesian model that fills in all unspecified aspects.
4.1 Statistical datatypes
This metadata constrains the probability models that will be used for each column of data and can also be used to choose appropriate visualizations. It is straightforward to support several different kinds of data:

Categorical values from a closed set.
This datatype includes a dictionary that maps the raw data values (often strings) to canonical numerical indexes for efficient storage and processing. This information can also be used to inform modeling tactics. For example, in the current version of MML, closedset categorical variables are modeled generatively via a multinomial component model with a symmetric Dirichlet prior on the parameters
(Mansinghka et al., 2015). Discriminative models for closedset categorical columns could potentially use a multinomial logit link function, or an appropriate multiclass classification scheme.

Binary data. Data of this type is generatively modeled using an asymmetric BetaBernoulli model (Mansinghka et al., 2015)
that can better handle sparse or marginally biased variables than a symmetric alternative. Also, a broad class of discriminative learning techniques can natively handle the binary classification problems induced by binary variables.

Count data. Nonnegative counts can be naturally modeled generatively by a PoissonGamma model or discriminatively by a GLM with the appropriate link function.

Numerical data. By default, data of this type is generatively modeled using a standard NormalGamma model. It is straightforward to add numerical ranges to enforce truncation posthoc, and to add numerical pretransformations that are appropriate for data that is naturally viewed as normal only on a log scale.
We have performed preliminary experiments on other datatypes built on standard statistical models. For example, cyclic data can be handled via a von Mises model (Gopal and Yang, 2014). Many other datatypes can be handled by the appropriate generalized linear model (McCullagh and Nelder, 1989). Broadening the set of primitive data types and assessing coverage on a representative corpus of realworld databases will be a key research challenge going forwards.
4.2 Bayesian generative population metamodels
Some data generators can be learned from data. Often the learning mechanism will be based on approximate probabilistic inference in a metamodel: a probabilistic model defined over a space of data generators, each of which is also a probabilistic model in its own right. Thus far, all BayesDB metamodels have been Markov chain metamodels. These metamodels internally maintain a single sample from an approximate posterior, and provide a Markov chain transition operator that updates this sample stochastically.

initialize(metaschema = )
Initializes a new metamodel with arbitrary parameters and an associated tabular data store.

incorporate(id = , values = )
Creates a new member of the population with the given row index and values and stores it. Errors result from duplicate indexes or variables whose values are not compatible with the metaschema (e.g. because the expected data type is incompatible with a provided value).

remove(id = )
Removes a member of the population from the data store.

infer(program = )
Simulate an internal Markov chain transition operator to improve the quality of the current sampled model representation:
Some Markov chain metamodels are asymptotically Bayesian, i.e. the distribution that results from sequences of updates converges to the posterior over metamodels as goes to infinity:
A sufficiently expressive Markov chain metamodel may also be asymptotically consistent in the usual sense. The default semiparametric GPM provided by BayesDB is designed to be both asymptotically consistent and asymptotically Bayesian; these invariants are crucial for its robustness, broad applicability, and suitability for use by nonexperts. Formally specifying and validating these properties is an important challenge for future research.
4.2.1 Controlling inference via INITIALIZE and ANALYZE.
The MML allows users to control the process by which models are created and updated to reflect the data. These capabilities are exposed via two commands:

INITIALIZE k MODELS FOR population
This command creates models by sampling their structure and parameters from the underlying GPM’s prior. This is implemented by delegation to initialize() where is the entire MML schema so far.

ANALYZE [variable subset OF] population FOR timelimit
This command performs approximately Bayesian updating of the models in the weighted collection by delegating to the infer() procedure from the underlying GPM. Here is a typical invocation:
ANALYZE mypopulation FOR 10 MINUTES
When no variable subset is provided, analysis is done on all the latent variables associated with every GPM in the weighted collection. Finergrained control is also possible using variable subset specifiers that pick out particular portions of the latent state in the GPM; these details are beyond the scope of this paper.
4.3 Qualitative constraints
The BayesDB MML provides constructs for specifying qualitative constraints on the dependence and independence relationships (Pearl, 1988). The modelbuilding engine attempts to enforce them in all GPMs^{1}^{1}1The current implementation does not attempt to detect contradictions.. These constraints are specified as follows:
ALTER METAMODEL FOR population ENSURE colA IS [NOT] MARGINALLY DEPENDENT ON colB
It is also possible to INITIALIZE and ANALYZE models that do not respect the constraints, and then enforce them after the fact:
ALTER MODELS FOR population ENSURE colA IS [NOT] MARGINALLY DEPENDENT ON colB
These commands enable domain experts to apply qualitative knowledge to make better use of sparse data. This can be crucial for improving analysis and model credibility in the eyes of domain experts. They also create the opportunity for false or unjustified knowledge to influence the results of analysis. This can reduce credibility in the eyes of statisticians or domain skeptics who want to see all assumptions in an analysis scrutinized empirically.
4.4 Incorporating foreign statistical models
A crucial aspect of MML is that it permits experts to override the automatic modelbuilding machinery using custombuilt statistical models. Feedforward networks of such models can be specified as follows:
ALTER SCHEMA FOR population MODEL output variables GIVEN input variables USING FOREIGN PREDICTOR FROM source file
Presently these models are presumed to be discriminative. They are only required to be able to simulate from a probability distribution over the output variables conditioned on the inputs, and to evaluate the probability density induced by this distribution.
4.5 A semiparametric factorial mixture GPM
The current implementation of MML implements all the above commands in terms of approximate inference in single, unusually flexible, semiparametric Bayesian metamodel. This GPM is closely related to CrossCat (Mansinghka et al., 2015). The CrossCat model is a factorial Dirichlet process mixture model, where variables are assigned to specific Dirichlet process mixtures by inference in another Dirichlet process mixture model over the columns. The version used for implementing MML adds two key components:

Deterministic constraints on model structure. Users can specify constraints on the marginal dependence or independence of arbitrary pairs of variables.

Feedforward networks of discriminative models conditioned on the outputs of the generative model. This allows users to combine generalpurpose density estimation with standard statistical techniques such as regression as well as complex computational models with noisy outputs.
Thus in MML, the CrossCat probability model is used as the root node in a directed graphical model. Each other node in the graph corresponds to specific discriminative model, directly conditioned on the inputs of its immediate ancestors. Undirected terms attached to the root node enforce deterministic constraints.
It is helpful to view this model in terms of a “divide and conquer” modeling strategy that bottoms out in foreign predictors and other standard parametric models from Bayesian statistics:

All variables not explicitly assigned to a custom model are divided into marginally independent groups. Variables in the same group are assumed to be marginally dependent. Partitions of variables that do not respect the given marginal dependence and independence constraints are rejected. Each group of variables induces an independent subproblem that will typically be far lower dimensional than the original highdimensional problem.

For each subproblem, divide the rows into clusters whose values are marginally dependent given any variablespecific hyperparameters.

For each cluster, use a simple product of parametric models — i.e. a “naive Bayes” approach
(Duda et al., 2001) — to estimate the joint distribution.
Inference in the GPM thus addresses modeling tradeoffs that resemble the decisions faced in exploratory analysis, confirmatory analysis, and predictive modeling. The most crucial decisions involve defining which subset of the data is relevant for answering each question. A secondary issue is what probabilistic model to use for each subset; absent prior knowledge, these are chosen generically, based on the type of the data in the column.
4.5.1 A “divideandconquer” generative process
The generative process that induces the default GPM can be described using the following notation:
Name  Description 

Concentration hyperparameter for CRP that slices the columns  
Hyperparameters for column (datatypedependent)  
Slice (column partition) assigned to column  
Concentration hyperparameter for CRP that clusters rows for slice  
Cluster assigned to row with respect to slice  
Model parameters for column cluster (datatypedependent)  
Values in cluster for column , i.e.  
An indicator such that iff is modeled by a foreign predictor  
The set of input dimensions for the foreign predictor  
conditionally modeling variable  
Parameters for the foreign predictor conditionally modeling variable  
The stochastic model for the foreign predictor used for  
variable (with density ) with  
Characteristic function enforcing marginal (in)dependence constraint  
A generic hyperprior of the appropriate type for variable or dimension .  
A datatypeappropriate parameter prior (e.g. a Beta prior for binary data,  
NormalGamma for continuous data, or Dirichlet for discrete data),  
and likelihood model (e.g. Bernoulli, Normal or Multinomial). 
Using this notation, the unconstrained generative process for the default metamodel can be concisely described in statistician’s notation as follows:
The true generative process also must ensure that for all of the (in)dependence constraints. This is enforced by conditioning on the event . A generative model for this constrained process can be given trivially by embedding the unconstrained generative process in the inner loop of a rejection sampler for (Mansinghka, 2009; Murray et al., 2009).
4.5.2 The joint density
Here we use to denote all the latent information in a semiparametric GPM needed to capture its dependence on the data . This includes the concentration parameter for the CRP over columns, the variablespecific hyperparameters , the column partition , the columnpartitionspecific concentration parameters and row partition , and the categoryspecific parameters . Note that in this section, , and
each represent probability density functions rather than stochastic simulators.
Given this notation, we have:
4.5.3 Inference via sequential Monte Carlo with Gibbs proposals and Gibbs rejuvenation
Inference in this metamodel is performed via a sequential Monte Carlo scheme, in which each row is incorporated incrementally, with all latent variables proposed from their conditional distribution. Additionally, clients can control the frequency and target latent variables for rejuvenation kernels based on Gibbs sampling, turning the overall scheme into a resamplemove algorithm (Andrieu et al., 2003; Smith et al., 2013). This combination enables parallel inference and estimation of marginal probabilities while allowing the bulk of the inferential work to be done via a suitable Markov chain.

incorporate(id = , values = )
Each row is incorporated via a single Gibbs step that numerically marginalizes out all the latent variables associated with the row (Smith et al., 2013; Murphy, 2002). The associated weight is the marginal probability of the measurements to be incorporated:
This operation is linear in the number of observed cells for the record being incorporated, the number of total slices, and the maximum number of clusters associated with any slice.

infer(iterations = , type = rows  columns  parameters  hyperparameters  foreign  resample, slice =  NA, cluster =  NA, foreign_predictor =  NA)
This operation applies a particular transition operator, specified by the arguments, to a selected subset of the latent variables. Each invocation affects all particles in the sequential Monte Carlo scheme. By varying the type parameter, a client can control whether inference is performed over the rowcluster assignment variables, the columnslice assignment variables, the cluster parameters, the columnspecific hyperparameters, or all latent variables associated with a specific foreign predictor. An invocation with type = resample applies multinomial resampling to the weighted collection of models.
This allows for a limited form of inference programming (Mansinghka et al., 2014), as follows. By varying the slice, cluster, or foreign_predictor
variables, clients can instruct the GPM to only perform inference on a specific subset of the latent variables. Computational effort can thus be focused on those latent variables that are most relevant for a given analysis, rather than uniformly distributed across all latent variables in the GPM. This is most useful when the queries of interest focus on a subset of the variables, or when the clusters are wellseparated.
The prototype implementation of BayesDB uses rowcluster, columnslice, clusterparameter, and columnhyperparameter transition operators from Mansinghka et al. (2015). The only modification is that the log joint density now includes terms for enforcing each of the (in)dependence constraints, and also terms for the likelihood induced by each foreign predictor, as described above.
This interface allows clients to specify multiple MCMC, SMC and hybrid strategies for inference. The default inference program that is invoked by the ANALYZE command in BQL does no resampling and selects slices and clusters to do inference on via systematic scans. It thus can be thought of as an MCMC scheme with multiple parallel chains. This approach is conservative and makes it easier to assess the stability and reproducibility of inference, although it is unlikely to be the most efficient approach in some cases.
5 Discussion
This paper has described BayesDB, a probabilistic programming platform that allows users to directly query the probable implications of statistical data. The query language can solve statistical inference problems such as detecting predictive relationships between variables, inferring missing values, simulating probable observations, and identifying statistically similar database entries. Statisticians and domain experts can incorporate (in)dependence constraints and custom models using a qualitative language for probabilistic models. The default metamodel frees users from needing to know how to choose modeling approaches, remove records with missing values, detect outliers, or tune model parameters. The prototype implementation is suitable for analyzing complex, heterogeneous data tables with up to tens of thousands of rows and hundreds of variables.
5.1 Related work in probabilistic programming
Most probabilistic programming languages are intended for model specification (Goodman et al., 2008; Stan Development Team, 2015; Milch et al., 2007; Pfeffer, 2009). This is fundamentally different from BQL and MML:

In BQL, probabilistic models are never explicitly specified. Instead, an implicit set of models is averaged over (or sampled from) as needed.

With MML, users specify constraints on an algorithm for model discovery and need not explicitly select any specific models. These constraints generally do not uniquely identify the structure of the model that will ultimately be used.
In contrast, with languages such as Stan (Stan Development Team, 2015), each program corresponds to a specific probabilistic model whose structure is fixed by the program source. Tabular (Gordon et al., 2014), a probabilistic language designed for embedding into spreadsheets that applies userspecified factor graph models defined in terms of observed and latent variables to datasets represented as subtables, seems closest in structure to BQL. However, like BUGS and Stan, Tabular does not aim to hide the conceptual vocabulary of probabilistic modeling from its end users, and it focuses on userspecified models. Other integrations of probabilistic modeling with databases such as (Singh and Graepel, 2013) are similarly focused on sophisticated modeling but do not provide a modelindependent abstraction for queries or support for general Bayesian data analysis.
It is straightforward to extend MML to allow syntactic escapes into all these languages that allow external probabilistic programs to be used as foreign predictors.
5.2 Related work in probabilistic databases
BayesDB takes a complementary approach to several recent projects that integrate aspects of probabilistic inference with databases. The most closely related systems are MauveDB (Deshpande and Madden, 2006) and BBQ (Deshpande et al., 2004). They provide modelbased views that enable users to run standard SQL queries on the outputs of statistical models. These models must be explicitly specified as part of the schema. This is useful for some machine learning applications but does not address the core problems of applied inference, such as data exploration, data cleaning, and confirmatory analysis. Both systems also use restricted model classes that can easily introduce substantial for adhoc predictive queries.
Other systems such as MLBase (Kraska et al., 2013) and GraphLab (Low et al., 2012) aim to simplify atscale development and deployment of machine learning algorithms. MLBase and GraphLab host data in a distributed database environment and provide operators for scalable ML algorithms. Systems such as SimSQL (Cai et al., 2013) and its ancestor, MCDB (Jampani et al., 2008), provide SQL operators for efficient Monte Carlo sampling. In principle, several of these systems could serve as runtime platforms for optimized implementations of BQL and the MML.
5.2.1 Uncertain data versus uncertain inference
The database research community has proposed several probabilistic databases that aim to simplify the management and querying of data that is “uncertain” or “imprecise” (Dalvi et al., 2009). This “data uncertainty” is different from the inferential uncertainty that motivates BayesDB. Even when the data is known with certainty, it is rarely possible to uniquely identify a single model that can be used with complete certainty. Second, each probable model is likely to have uncertain implications. Extensions of BayesDB that augment GPMs with probabilistically coherent treatments of data uncertainty are an important area for future research.
5.3 Limitations and future work
Additional GPMs and metamodels are needed for some applications. There are specialized SQL databases that strike different tradeoffs between query latency, workload variability, and storage efficiency. Similarly, we expect that future GPMs and metamodels will strike different tradeoffs between prediction speed, prediction accuracy, statistical model capacity, and the amount of available data. In some cases, the semiparametric metamodel presented here may be adequate in principle but producing an appropriate implementation is a significant systems research project. For example, it may be possible to build versions suitable for adhoc exploration of distributed databases such as Dremel (Melnik et al., 2010) or Spark (Zaharia et al., 2010). In other cases, fundamentally different model classes may be more appropriate. For example, it seems appealing to jointly model populations of web browsing sessions and web assets with lowdimensional latent space models (Stern et al., 2009).
It will be challenging to develop query planners that can handle GPMs given by arbitrary probabilistic programs. A key issue is that the full GPM interface allows for complex conditional queries over composite GPMs that may require datadependent inference strategies. One potential approach is to specify GPMs as probabilistic programs in a language with programmable inference; currently, the only such language is VentureScript. The inference strategy needed to answer a given query could then be assembled ondemand.
BQL and MML have yet to incorporate key ideas from several significant subfields of statistics. For example, neither language has explicit support for causal relationships and arbitrary counterfactuals (Pearl, 1988, 2009, 2001). Both BQL and MML make the standard, simplistic assumption that data is missing at random. Neither BQL nor MML has native support for longitudinal or panel data or for timeseries; instead, users must apply standard workarounds or implement custom data types. A minor limitation is that hierarchical models are currently supported by merging subpopulations, retaining an indicator variable, and treating any variables unique to a given subpopulation as missing. It should instead be possible to build GPMs that jointly model subpopulations that are separately represented (and that therefore may not share the same set of observable variables). It will also be important to develop a formal semantics and cost model for both BQL and MML.
Qualitative probabilistic programming. BQL and MML are qualitative languages for quantitative reasoning. They make it possible for users to perform Bayesian data analysis without needing to know how to specify quantitative probabilities or model parameters. However, the set of qualitative constructs that they support is limited, and needs to be expanded. For example, in MML, it will be important to support conditional dependence constraints. These could be specified generatively, e.g. by defining a directed acyclic graph over subsets of variables, and leaving the model builder to fill in the (conditional) joint distributions over each subset of variables. In BQL, it would be interesting to explore the addition of commands for optimization and decisiontheoretic choice, with objective functions specified both explicitly and implicitly. Finally, it will be interesting to explore elicitation strategies based on “programming by example”. For example, users could create datasets by iteratively specifying prototypical examples and turn them into large datasets by treating each as the seed for a separate synthetic population, produced via SIMULATE.
5.4 Conclusion
Traditional databases protect consumers of data from “having to know how the data is organized in the machine” Codd (1970) and provide automated data representations and retrieval algorithms that perform well enough for a broad class of applications. Although this abstraction barrier is only imperfectly achieved, it has proved useful enough to serve as the basis of multiple generations of software and data systems. This decoupling of task specification from implementation made it possible to improve performance and reliability — of individual database indexes, and in some cases of entire database systems — without needing to notify end users. It also created a simple conceptual vocabulary and query language for data management and data processing that spread far farther than the systems programming knowledge needed to implement it.
BayesDB aims to insulate consumers of statistical inference from the concepts of modeling and statistics and provide a simple, qualitative interface for solving problems that currently seem quantitative and complex. It also allows models, analyses, and data resources to be improved independently. It is not yet clear how deeply the analogy with traditional databases will run. However, we hope that BayesDB represents a significant step towards making statistically rigorous empirical inference more credible, transparent and ubiquitous.
References
 Andrieu et al. (2003) Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(12):5–43, 2003.
 Cai et al. (2013) Zhuhua Cai, Zografoula Vagena, Luis Perez, Subramanian Arumugam, Peter J Haas, and Christopher Jermaine. Simulation of databasevalued markov chains using simsql. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 637–648. ACM, 2013.
 Codd (1970) Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377–387, 1970.
 Cover and Thomas (2012) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Dalvi et al. (2009) Nilesh Dalvi, Christopher Ré, and Dan Suciu. Probabilistic databases: diamonds in the dirt. Communications of the ACM, 52(7):86–94, 2009.
 Deshpande and Madden (2006) Amol Deshpande and Samuel Madden. Mauvedb: supporting modelbased user views in database systems. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 73–84. ACM, 2006.
 Deshpande et al. (2004) Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong. Modeldriven data acquisition in sensor networks. In Proceedings of the Thirtieth international conference on Very large data basesVolume 30, pages 588–599. VLDB Endowment, 2004.
 Duda et al. (2001) R.O. Duda, P.E. Hart, D.G. Stork, et al. Pattern classification, volume 2. wiley New York, 2001.
 Gelman et al. (1995) Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis. Chapman and Hall, London, 1995.
 Goodman et al. (2008) Noah D. Goodman, Vikash K. Mansinghka, Daniel Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: a language for generative models. In Uncertainty in Artificial Intelligence, 2008.
 Gopal and Yang (2014) Siddharth Gopal and Yiming Yang. von misesfisher clustering models. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 154–162, 2014.
 Gordon et al. (2014) Andrew D Gordon, Thore Graepel, Nicolas Rolland, Claudio Russo, Johannes Borgstrom, and John Guiver. Tabular: a schemadriven probabilistic programming language. In ACM SIGPLAN Notices, volume 49, pages 321–334. ACM, 2014.
 Jampani et al. (2008) Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine, and Peter J Haas. Mcdb: a monte carlo approach to managing uncertain data. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 687–700. ACM, 2008.
 Kraska et al. (2013) Tim Kraska, Ameet Talwalkar, John C Duchi, Rean Griffith, Michael J Franklin, and Michael I Jordan. Mlbase: A distributed machinelearning system. In CIDR, 2013.
 Low et al. (2012) Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716–727, 2012.
 Mansinghka et al. (2014) Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.

Mansinghka et al. (2015)
Vikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max Gasner, and
Joshua Tenenbaum.
Crosscat: A fully bayesian nonparametric method for analyzing heterogeneous, high dimensional data.
Journal of Machine Learning Research, 2015.  Mansinghka (2009) Vikash Kumar Mansinghka. Natively probabilistic computation. PhD thesis, Massachusetts Institute of Technology, 2009.
 McCullagh and Nelder (1989) Peter McCullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989.
 Melnik et al. (2010) Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of webscale datasets. Proceedings of the VLDB Endowment, 3(12):330–339, 2010.
 Milch et al. (2007) Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L Ong, and Andrey Kolobov. 1 blog: Probabilistic models with unknown objects. Statistical relational learning, page 373, 2007.

Murphy (2002)
Kevin Patrick Murphy.
Dynamic bayesian networks: representation, inference and learning
. PhD thesis, University of California, Berkeley, 2002.  Murray et al. (2009) Iain Murray, David MacKay, and Ryan P Adams. The gaussian process density sampler. In Advances in Neural Information Processing Systems, pages 9–16, 2009.
 Neal (1998) R. Neal. Markov chain sampling methods for dirichlet process mixture models, 1998.
 Pearl (1988) Judea Pearl. Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann, San Mateo, California, 1988.
 Pearl (2001) Judea Pearl. Bayesianism and causality, or, why i am only a halfbayesian. In Foundations of bayesianism, pages 19–36. Springer, 2001.
 Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
 Pfeffer (2009) Avi Pfeffer. Figaro: An objectoriented probabilistic programming language. Charles River Analytics Technical Report, 137, 2009.

Rasmussen (2000)
C. Rasmussen.
The infinite gaussian mixture model.
In Advances in Neural Processing Systems 12, 2000.  Russell and Norvig (2003) Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. Prentice Hall, Upper Saddle River, New Jersey, 2003.
 Singh and Graepel (2013) Sameer Singh and Thore Graepel. Automated probabilistic modeling for relational data. In Proceedings of the ACM of Information and Knowledge Management CIKM 2013. ACM, 2013. URL http://research.microsoft.com/apps/pubs/default.aspx?id=200220.
 Smith et al. (2013) Adrian Smith, Arnaud Doucet, Nando de Freitas, and Neil Gordon. Sequential Monte Carlo methods in practice. Springer Science & Business Media, 2013.
 Stan Development Team (2015) Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.8.0, 2015. URL http://mcstan.org/.
 Stern et al. (2009) David H Stern, Ralf Herbrich, and Thore Graepel. Matchbox: large scale online bayesian recommendations. In Proceedings of the 18th international conference on World wide web, pages 111–120. ACM, 2009.
 Wasserman (2011) L. Wasserman. Low assumptions, high dimensions. Rationality, Markets and Morals, 2, 2011.
 Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010.
Comments
There are no comments yet.