DeepAI AI Chat
Log In Sign Up

CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

by   Vikash Mansinghka, et al.
University of Louisville

There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.


page 3

page 24

page 25

page 30

page 34

page 35

page 36

page 37


BNPdensity: Bayesian nonparametric mixture modeling in R

Robust statistical data modelling under potential model mis-specificatio...

Fast and Order-invariant Inference in Bayesian VARs with Non-Parametric Shocks

The shocks which hit macroeconomic models such as Vector Autoregressions...

Hierarchical Infinite Relational Model

This paper describes the hierarchical infinite relational model (HIRM), ...

A Dirichlet Process Mixture Model for Directional-Linear Data

Directional data require specialized probability models because of the n...

Bayesian Hierarchical Mixture Clustering using Multilevel Hierarchical Dirichlet Processes

This paper focuses on the problem of hierarchical non-overlapping cluste...

Bayesian Nonparametric Mixtures of Exponential Random Graph Models for Ensembles of Networks

Ensembles of networks arise in various fields where multiple independent...