CrossCat: A Fully Bayesian Nonparametric Method for Analyzing Heterogeneous, High Dimensional Data

12/03/2015
by   Vikash Mansinghka, et al.
0

There is a widespread need for statistical methods that can analyze high-dimensional datasets with- out imposing restrictive or opaque modeling assumptions. This paper describes a domain-general data analysis method called CrossCat. CrossCat infers multiple non-overlapping views of the data, each consisting of a subset of the variables, and uses a separate nonparametric mixture to model each view. CrossCat is based on approximately Bayesian inference in a hierarchical, nonparamet- ric model for data tables. This model consists of a Dirichlet process mixture over the columns of a data table in which each mixture component is itself an independent Dirichlet process mixture over the rows; the inner mixture components are simple parametric models whose form depends on the types of data in the table. CrossCat combines strengths of mixture modeling and Bayesian net- work structure learning. Like mixture modeling, CrossCat can model a broad class of distributions by positing latent variables, and produces representations that can be efficiently conditioned and sampled from for prediction. Like Bayesian networks, CrossCat represents the dependencies and independencies between variables, and thus remains accurate when there are multiple statistical signals. Inference is done via a scalable Gibbs sampling scheme; this paper shows that it works well in practice. This paper also includes empirical results on heterogeneous tabular data of up to 10 million cells, such as hospital cost and quality measures, voting records, unemployment rates, gene expression measurements, and images of handwritten digits. CrossCat infers structure that is consistent with accepted findings and common-sense knowledge in multiple domains and yields predictive accuracy competitive with generative, discriminative, and model-free alternatives.

READ FULL TEXT

page 3

page 24

page 25

page 30

page 34

page 35

page 36

page 37

research
10/19/2021

BNPdensity: Bayesian nonparametric mixture modeling in R

Robust statistical data modelling under potential model mis-specificatio...
research
05/26/2023

Fast and Order-invariant Inference in Bayesian VARs with Non-Parametric Shocks

The shocks which hit macroeconomic models such as Vector Autoregressions...
research
08/16/2021

Hierarchical Infinite Relational Model

This paper describes the hierarchical infinite relational model (HIRM), ...
research
12/21/2022

A Dirichlet Process Mixture Model for Directional-Linear Data

Directional data require specialized probability models because of the n...
research
05/13/2019

Bayesian Hierarchical Mixture Clustering using Multilevel Hierarchical Dirichlet Processes

This paper focuses on the problem of hierarchical non-overlapping cluste...
research
12/10/2012

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

The classical mixture of Gaussians model is related to K-means via small...
research
01/20/2022

Bayesian Nonparametric Mixtures of Exponential Random Graph Models for Ensembles of Networks

Ensembles of networks arise in various fields where multiple independent...

Please sign up or login with your details

Forgot password? Click here to reset