Bayesian data selection

09/06/2021
by   Eli N. Weinstein, et al.
0

Insights into complex, high-dimensional data can be obtained by discovering features of the data that match or do not match a model of interest. To formalize this task, we introduce the "data selection" problem: finding a lower-dimensional statistic - such as a subset of variables - that is well fit by a given parametric model of interest. A fully Bayesian approach to data selection would be to parametrically model the value of the statistic, nonparametrically model the remaining "background" components of the data, and perform standard Bayesian model selection for the choice of statistic. However, fitting a nonparametric model to high-dimensional data tends to be highly inefficient, statistically and computationally. We propose a novel score for performing both data selection and model selection, the "Stein volume criterion", that takes the form of a generalized marginal likelihood with a kernelized Stein discrepancy in place of the Kullback-Leibler divergence. The Stein volume criterion does not require one to fit or even specify a nonparametric background model, making it straightforward to compute - in many cases it is as simple as fitting the parametric model of interest with an alternative objective function. We prove that the Stein volume criterion is consistent for both data selection and model selection, and we establish consistency and asymptotic normality (Bernstein-von Mises) of the corresponding generalized posterior on parameters. We validate our method in simulation and apply it to the analysis of single-cell RNA sequencing datasets using probabilistic principal components analysis and a spin glass model of gene regulation.

READ FULL TEXT
research
03/17/2018

Large-Scale Model Selection with Misspecification

Model selection is crucial to high-dimensional learning and inference fo...
research
09/12/2023

A Consistent and Scalable Algorithm for Best Subset Selection in Single Index Models

Analysis of high-dimensional data has led to increased interest in both ...
research
03/28/2018

On Model Selection with Summary Statistics

Recently, many authors have cast doubts on the validity of ABC model cho...
research
04/16/2020

Assessing the Significance of Model Selection in Ecology

Model Selection is a key part of many ecological studies, with Akaike's ...
research
06/29/2020

Data integration in high dimension with multiple quantiles

This article deals with the analysis of high dimensional data that come ...
research
02/17/2023

Approximate Bayes Optimal Pseudo-Label Selection

Semi-supervised learning by self-training heavily relies on pseudo-label...
research
12/23/2014

Model Selection in High-Dimensional Misspecified Models

Model selection is indispensable to high-dimensional sparse modeling in ...

Please sign up or login with your details

Forgot password? Click here to reset