cgpm
Library of composable generative population models which serve as the modeling and inference backend of BayesDB.
view repo
Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database search operators with an information theoretic ranking function called predictive relevance. Predictive relevance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonparametric Bayesian model for highdimensional, heterogeneouslytyped data tables. The result is a flexible search technique that applies to a broad class of information retrieval problems, which we integrate into BayesDB, a probabilistic programming platform for probabilistic data analysis. This paper demonstrates applications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. We found that human evaluators often prefer the results from probabilistic search to results from a standard baseline.
READ FULL TEXT VIEW PDFLibrary of composable generative population models which serve as the modeling and inference backend of BayesDB.
We are surrounded by multivariate data, yet it is difficult to search. Consider the problem of finding a university with a city campus, low student debt, high investment in student instruction, and tuition fees within a certain budget. The US College Scorecard dataset (Council of Economic Advisers, 2015) contains these variables plus hundreds of others. However, choosing thresholds for the quantitative variables — debt, investment, tuition, etc — requires domain knowledge. Furthermore, results grow sparse as more constraints are added. Figure 1 shows results from an SQL SELECT query with plausible thresholds for this question that yields only a single match.
This paper shows how to formulate a broad class of probabilistic search queries on structured data using probabilistic programming and information theory. The core technical idea combines SQL search operators with a ranking function called predictive relevance that assesses the relevance of database records to some set of query records, in a context defined by a variable of interest. Figures 1 and 1 show two examples, expanding and then refining the result from Figure 1
by combining predictive relevance with SQL. Predictive relevance is the probability that a candidate record is informative about the answers to a specific class of predictive queries about unknown fields in the query records.
The paper presents an efficient implementation applying a simple sparse matrix algorithm to the results of inference in CrossCat (Mansinghka et al., 2016). The result is a scalable, domaingeneral search technique for sparse, multivariate, structured data that combines the strengths of SQL search with probabilistic approaches to information retrieval. Users can query by example, using real records in the database if they are familiar with the domain, or partiallyspecified hypothetical records if they are less familiar. Users can then narrow search results by adding Boolean filters, and by including multiple records in the query set rather than a single record. An overview of the technique and its integration into BayesDB (Mansinghka et al., 2015) is shown in Figure 3.
We demonstrate the proposed technique with databases of (i) US colleges, (ii) public health and macroeconomic indicators, and (iii) cars from the late 1980s. The paper empirically confirms the scalability of the technique and shows that human evaluators often prefer results from the proposed technique to results from a standard baseline.



BayesDB workflow for computing contextspecific predictive relevance between database records. Modeling and inference in BayesDB produces an ensemble of posterior CrossCat model structures. Each structure specifies (i) a column partition for the factorization of the joint distribution of all variables in the database, using a Chinese restaraunt process; and (ii) a separate row partition within each block of variables, using a Dirichlet process mixture. The column partition clusters variables into different “contexts”, where all variables in a context are probably dependent on one another. With each context, the row partition clusters records which are probably informative of one another. Enduser queries for predictive relevance are expressed in Bayesian Query Langauge. The BQL interpreter aggregates relevance probabilities across the ensemble, and can use them as a ranking function in a probabilistic
ORDER BY query.In this section, we outline the basic setup and notations for the database search problem, and establish a formal definition of the probability of “predictive relevance” between records in the database.
Suppose we are given a sparse dataset containing records, where each is an instantiation of a
dimensional random vector, possibly with missing values. For notational convenience, we refer to arbitrary collections of observations using sets as indices, so that
. Boldface symbols denote multivariate entities, and variables are capitalized as when they are unobserved (i.e. random).Let index a small collection of “query records” . Our objective is to rank each item by how relevant it is for formulating predictions about values of , “in the context” of a particular dimension . We formally define the context of as a subset of dimensions such that for an arbitrary record and each
, the random variable
is statistically dependent with .^{1}^{1}1A general definition for statistical dependence is having nonzero mutual information with the context variable. However, the method for detecting dependence to find variables in the context can be arbitrary e.g., using linear statistics such as PearsonR, directly estimating mutual information, or others.
In other words, we are searching for records where knowledge of is useful for predicting , had we not known the values of these observations.
We now formalize the intuition from the previous section more precisely. Let denote the probability that is predictively relevant to , in the context of . Furthermore, let denote the index of a new dimension in the length random vectors, which is statistically dependent on dimension (i.e. is in its context) but is not one of the existing variables in the database. Since indexes a novel variable, its value for each row is itself a random variable, which we denote . We now define the probability that is predictively relevant to in the context of
as the posterior probability that the mutual information of
and each query record is nonzero:(1)  
The symbol refers to an arbitrary set of hyperparameters which govern the distribution of dimension , and is a contextspecific hyperparameter which controls the prior on structural dependencies between the random variables . Moreover, the mutual information , a wellestablished measure for the strength of predictive relationships between random variables (Cover and Thomas, 2012), is defined in the usual way,
(2)  
Figure 2 illustrates the predictive relevance probability in terms of a hypothesis test on two competing graphical models, where the mutual information is nonzero in panel subfig:hypothesissame indicating predictive relevance; and zero in panel subfig:hypothesisdiff, indicating predictive irrelevance.
Our formulation of predictive relevance in terms of mutual information between new variables is related to the idea of “property induction” from the cognitive science literature (Rips, 1975; Osherson et al., 1990; Shafto et al., 2008), where subjects are asked to predict whether an entity has a property, given that some other entity has that property; e.g. how likely are cats to have some new disease, given that mice are known to have the disease?
It is also informative to consider the relationship between the predictive relevance in Eq (1) and the Bayesian Sets ranking function from the statistical modeling literature (Ghahramani and Heller, 2005):
(3) 
Bayes Sets defines a Bayes Factor, or ratio of marginal likelihoods, which is used for hypothesis testing without assuming a structure prior. On the other hand, predictive relevance defines a posterior probability, whose value is between 0 and 1, and therefore requires a prior over dependence structure between records (our approach outlined in Section
3 is based on nonparametric Bayes). While Bayes Sets draws inferences using only the query and candidate rows without considering the rest of the data, predictive relevance probabilities are necessarily conditioned on as in Eq (1). Finally Bayes Sets considers the entire data vectors for scoring, whereas predictive relevance considers only dimensions which are in the context of a variable , making it possible for two records to be predictively relevant in some context but probably predictively irrelevant in another.This section describes the crosscategorization prior (CrossCat, Mansinghka et al. (2016)) and outlines algorithms which use CrossCat to efficiently estimate predictive relevance probabilities Eq (1) for sparse, highdimensional, and heterogenouslytyped data tables.
CrossCat is a nonparametric Bayesian model which learns the full joint distribution of variables using structure learning and divideandconquer. The generative model begins by partitioning the set of variables into blocks using a Chinese restaurant process. This step is CrossCat’s “outer” clustering, since it partitions the columns of a data table where variables correspond to columns, and records correspond to rows. Let denote the partition of whose th block is : for , all variables in are mutually (marginally and conditionally) independent of all variables in . Within block , the variables follow a Dirichlet process mixture model (Escobar and West, 1995), where we focus on the case the joint distribution factorizes given the latent cluster assignment . This step is an “inner” clustering in CrossCat, since it specifies a cluster assignment for each row in block . CrossCat’s combinatorial structure requires detailed notation to track the latent variables and dependencies between them. The generative process for an exchangeable sequence of random vectors is summarized below.
Symbol  Description 

Concentration hyperparameter of column CRP  
Concentration hyperparameter of row CRP  
Index of variable in column partition  
List of variables in block of column partition  
Cluster index of in row partition of block  
List of rows in cluster of block  
Joint distribution of data for variable  
Hyperparameters of  
th observation of variable  
Unique items in list 
CrossCat Prior
1. Sample column partition into blocks.  

foreach  
2. Sample row partitions within each block.  
foreach  
foreach  
foreach  
3. Sample data jointly within row cluster.  
foreach  
foreach  
foreach 
The representation of CrossCat in this paper assumes that data within a cluster is sampled jointly (step 3), marginalizing over clusterspecific distributional parameters:
This assumption suffices for our development of predictive relevance, and is applicable to a broad class of statistical data types (Saad and Mansinghka, 2016)
with conjugate priorlikelihood representations such as BetaBernoulli for binary, DirichletMultinomial for categorical, NormalInverseGammaNormal for real values, and GammaPoisson for counts.
Given dataset , we refer to Obermeyer et al. (2014) and Mansinghka et al. (2016) for scalable algorithms for posterior inference in CrossCat, and assume we have access to an ensemble of posterior samples where each is a realization of all variables in Table 1.
We now describe how to use posterior samples of CrossCat to efficiently estimate the predictive relevance probability from Eq (1). Letting denote the context variable, we formalize the novel variable as a fresh column in the tabular population which is assigned to the same block as (i.e. . As shown by Saad and Mansinghka (2017), structural dependencies induced by CrossCat’s variable partition are related to an upperbound on the probability there exists a statistical dependence between and . To estimate Eq (1), we first treat the mutual information between and as a derived random variable, which is a function of their random cluster assignments and ,
(4) 
The key insight, implied by step 3 of the CrossCat prior, is that, conditioned on their assignments, rows from different clusters are sampled independently, which gives
(5) 
where the final implication follows directly from the definition of mutual information in Eq (2). Note that Eq (5) does not depend on the particular choice of , and indeed this hyperparameter is never represented explicitly. Moreover, hyperparameter (corresponding to in Figure 2) is the concentration of the Dirichlet process for CrossCat row partitions.
Eq (5) implies that we can estimate the probability of non zero mutual information between and each for by forming a Monte Carlo estimate from the ensemble of posterior CrossCat samples,
(6) 
where indexes the context block, and denotes cluster assignment of in the row partition of , according to the sample . Algorithm 1 outlines a procedure (used by the BayesDB query engine from Figure 3) for formulating a Monte Carlo based estimator for a predictive relevance query using CrossCat.
In this section, we show how to greatly optimize the naive, nested forloop implementation in Algorithm 1 by instead computing predictive relevance for all through a single matrixvector multiplication.
Define the pairwise cluster cooccurrence matrix for block of CrossCat sample to have binary entries . Furthermore, let denote a length vector with a 1 at indexes and 0 otherwise. We vectorize across by:
(7)  
(8) 
The resulting length vector in Eq (7) satisfies if and only if for all , which we identify as the argument of the indicator function in Eq (6). Finally, by averaging across the samples in Eq (8), we arrive at the vector of relevance probabilities.
For large datasets, constructing the matrix using operations is prohibitively expensive. Algorithm 2 describes an efficient procedure that exploits CrossCat’s sparsity to build in expected time by using (i) a sparse matrix representation, and (ii) CrossCat’s partition data structures to avoid considering all pairs of rows. This fast construction means that Eq (7) is practical to implement for large data tables.
The algorithm’s running time depends on (i) the number of clusters in line 3; (ii) the average number of rows per cluster in line 4; and (iii) the data structures used to represent in line 5. Under the CRP prior, the expected number of clusters is , which implies an average occupancy of rows per cluster. If the sparse binary matrix is stored with a listoflists representation, then the update in line 5 requires time. Furthermore, we emphasize that since does not depend , its cost of construction is amortized over an arbitrary number of queries.
We have so far assumed that the query records must consist of items that already exist in the database. This section relaxes this restrictive assumption by illustrating how to compute relevance probabilities for search records which do not exist in , and are instead specified by the user on a perquery basis (refer to the BQL query in Figure 3 for an example of a hypothetical query record). The key idea is to (i) incorporate the new records into each CrossCat sample by using a Gibbsstep to sample cluster assignments from the joint posterior (Neal, 2000); (ii) compute Eq (7) on the updated samples; and (iii) unincorporate the records, leaving the original samples unmutated.
Letting denote (partially observed) new rows and the query, we compute for all by first applying CrossCatIncorporateRecord (Algorithm 3) to each sequentially. Sequential incorporation corresponds to sampling from the sequence of predictive distributions, which, by exchangeability, ensures that each updated contains a sample of cluster assignments from the joint distribution, guaranteeing correctness of the Monte Carlo estimator in Eq (6). Note that since CrossCat specifies a nonparametric mixture, the proposal clusters include all existing clusters, plus one singleton cluster . We next update the cooccurrence matrices in time linear in the size of the sampled cluster and then evaluate Eq (7) and (8). To unincorporate, we reverse lines 911 and restore the cooccurrence matrices. Figure 4 confirms that the runtime scaling is asymptotically linear, varying the (i) number of new rows, (ii) fraction of variables specified for the new rows that are in the context block (i.e. query sparsity), (iii) number of clusters in the context block, and (iv) number of variables in the context block.

Pairwise cosine similarities in different contexts 


This section illustrates the efficacy of predictive relevance in BayesDB by applying the technique to several search problems in realworld, sparse, and highdimensional datasets of public interest.^{2}^{2}2Appendix D contains a further application to a dataset of classic cars from 1987. Appendix A formally describes the integration of RELVANCE PROBABILITY into BayesDB as an expression in the Bayesian Query Language (Figure 3).
The College Scorecard (Council of Economic Advisers, 2015) is a federal dataset consisting of over 7000 colleges and 1700 variables, and is used to measure and improve the performance of US institutions of higher education. These variables include a broad set of categories such as the campus characteristics, academic programs, student debt, tuition fees, admission rates, instructional investments, ethnic distributions, and completion rates. We analyzed a subset of 2000 schools (fouryear institutions) and 100 variables from the categories listed above.
Suppose a student is interested in attending a city university with a set of desired specifications. Starting with a standard SQL Boolean search in Figure 1 (on p. 1) they find only one matching record, which requires iteratively rewriting the search conditions to retrieve more results.
Figure 1 instead expresses the search query as a hypothetical row in a BQL PREDICTIVE RELEVANCE query (which invokes the technique in Section 3.3). The topranking records contain firstrate schools, but their admission rates are much too stringent. In Figure 1, the user reexpresses the BQL query to rank schools by predictive relevance, in the context of instructional investment, to a subset of the firstrate schools discovered in 1. Combining ORDER BY PREDICTIVE RELEVANCE with Boolean conditions in the WHERE clause returns another set of topquality schools with citycampuses that are less competitive than those in 1, but have quantitative metrics that are much better than national averages.
Gapminder (Rosling, 2008) is an extensive longitudinal dataset of over 320 global macroeconomic variables of population growth, education, climate, trade, welfare and health for 225 countries. Our experiments are based on a crosssection of the data from the year 2002. The data is sparse, with 35% of the data missing. Figure 5 shows heatmaps of the pairwise predictive relevances for all countries in the dataset under different contexts, and compares the results to cosine similarity. Clusters of predictively relevant countries form commonsense taxonomies; refer to the caption for further discussion.
Figure 6 finds the top15 countries in the dataset ordered by their predictive relevance to the United States, in the context of “life expectancy at birth”. Table 6 shows representative variables which are in the context; these variables have the highest dependence probability with the context variable, according a Monte Carlo estimate using 64 posterior CrossCat samples. The countries in Figure 6 are all rich, Western democracies with highly developed economies and advanced healthcare systems.
To quantitatively evaluate the quality of topranked countries returned by predictive relevance, we ran the technique on 10 representative search queries (varying the country and context variable) and obtained the top 10 results for each query. Figure 7
shows the queries, and human preferences for the results from predictive relevance versus results from cosine similarity between the country vectors. We defined the context for cosine similarity by the 320dimensional vectors down to 10 dimensions and selecting variables which are most dependent with the context variable according to CrossCat’s dependence probabilities. To deal with sparsity, which cosine similarity cannot handle natively, we imputed missing values using sample medians; imputation techniques like MICE
(Buuren and GroothuisOudshoorn, 2011) resulted in little difference (Appendix C).This paper has shown how to perform probabilistic searches of structured data by combining ideas from probabilistic programming, information theory, and nonparametric Bayes. The demonstrations suggest the technique can be effective on sparse, realworld databases from multiple domains and produce results that human evaluators often preferred to a standard baseline.
More empirical evaluation is clearly needed, ideally including tests of hundreds or thousands of queries, more complex query types, and comparisons with query results manually provided by human domain experts. In fact, search via predictive relevance in the context of variables drawn from learned representations of data could potentially provide a meaningful way to compare representation learning techniques. It also may be fruitful to build a distributed implementation suitable for database representations of webscale data, including photos, social network users, and web pages.
Relatively unstructured probabilistic models, such as topic models, proved sufficient for making unstructured text data far more accessible and useful. We hope this paper helps illustrate the potential for structured probabilistic models to improve the accessibility and usefulness of structured data.
The authors wish to acknowledge Ryan Rifkin, Anna Comerford, Marie Huber, and Richard Tibbetts for helpful comments on early drafts. This research was supported by DARPA (PPAML program, contract number FA87501420004), IARPA (under research contract 201515061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF1310212), and gifts from Analog Devices and Google.
CrossCat: A fully Bayesian nonparametric method for analyzing heterogeneous, high dimensional data.
Journal of Machine Learning Research
, 17(138):1–49, 2016.Scaling nonparametric Bayesian inference via subsampleannealing.
InProceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics
, pages 696–705. JMLR.org, 2014.This section describes the integration of predictive relevance into BayesDB (Mansinghka et al., 2015; Saad and Mansinghka, 2016), a probabilistic programming platform for probabilistic data analysis.
New syntaxes in the Bayesian Query Language (BQL) allow a user to express predictive relevance queries where the query set can be an arbitrary combination of existing and hypothetical records. We implement predictive relevance in BQL as an expression with the following syntaxes, depending on the specification of the query records.
[leftmargin=*]
Query records are existing rows.
Query records are hypothetical rows.
Query records are existing and hypothetical rows.
The expression is formally implemented as a 1row BQL estimand, which specifies a map for each record in the table. As shown in the expressions above, query records are specified by the user in two ways: (i) by giving a collection of EXISTING ROWS, whose primary key indexes are either specified manually, or retrieved using an arbitrary BQL <expression>; (ii) by specifying one or more HYPOTHETICAL RECORDS with their <values> as a list of columnvalue pairs. These new rows are first incorporated using Algorithm 3 from Section 3.3 and they are then unincorporated after the query is finished. The <contextvar> can be any variable in the tabular population.
As a 1row function in the structured query language, the RELEVANCE PROBABILITY expression can be used in a variety of settings. Some typical usecases are shown in the following examples, where we use only existing query rows for simplicity.
[leftmargin=*]
As a column in an ESTIMATE query.
As a filter in WHERE clause.
As a comparator in an ORDER BY clause.
It is also possible to perform arithmetic operations and Boolean comparisons on relevance probabilities.
Finding the mean relevance probability for a set of rowids of interest.
Finding rows which are more relevant in some context than in another context .










codeframe linecolor=white,leftmargin=0,rightmargin=0,innertopmargin=0,innerbottommargin=0,innerleftmargin=0,innerrightmargin=0,
make  price  wheels  doors  engine  horsepower  body 

mercedes  40,960  rear  four  308  184  sedan 
make  price  wheels  doors  engine  horsepower  body 

jaguar  35,550  rear  four  258  176  sedan 
jaguar  32,250  rear  four  258  176  sedan 
mercedes  40,960  rear  four  308  184  sedan 
mercedes  45,400  rear  two  304  184  hardtop 
mercedes  34,184  rear  four  234  155  sedan 
mercedes  35,056  rear  two  234  155  convertible 
bmw  36,880  rear  four  209  182  sedan 
bmw  41,315  rear  two  209  182  sedan 
bmw  30,760  rear  four  209  182  sedan 
jaguar  36,000  rear  two  326  262  sedan 










codeframe linecolor=white,leftmargin=0,rightmargin=0,innertopmargin=0,innerbottommargin=0,innerleftmargin=0,innerrightmargin=0,
make  price  wheels  doors  engine  horsepower  body 

mercedes  40,960  rear  four  308  184  sedan 
make  price  wheels  doors  engine  horsepower  body 

jaguar  35,550  rear  four  258  176  sedan 
jaguar  32,250  rear  four  258  176  sedan 
mercedes  40,960  rear  four  308  184  sedan 
mercedes  45,400  rear  two  304  184  hardtop 
mercedes  34,184  rear  four  234  155  sedan 
mercedes  35,056  rear  two  234  155  convertible 
bmw  36,880  rear  four  209  182  sedan 
bmw  41,315  rear  two  209  182  sedan 
bmw  30,760  rear  four  209  182  sedan 
jaguar  36,000  rear  two  326  262  sedan 
Comments
There are no comments yet.