Data thinning for convolution-closed distributions

01/18/2023
by   Anna Neufeld, et al.
0

We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution closed" distribution, a class that includes the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. It is similar in spirit to – but distinct from, and more easily applicable than – a recent proposal known as data fission. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/23/2022

Integrative conformal p-values for powerful out-of-distribution testing with labeled outliers

This paper develops novel conformal methods to test whether a new observ...
research
01/25/2019

Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

Cross-validation of predictive models is the de-facto standard for model...
research
07/18/2018

Cross Validation Based Model Selection via Generalized Method of Moments

Structural estimation is an important methodology in empirical economics...
research
07/24/2023

Negative binomial count splitting for single-cell RNA sequencing data

The analysis of single-cell RNA sequencing (scRNA-seq) data often involv...
research
06/30/2015

Fast Cross-Validation for Incremental Learning

Cross-validation (CV) is one of the main tools for performance estimatio...
research
04/03/2012

Validation of nonlinear PCA

Linear principal component analysis (PCA) can be extended to a nonlinear...
research
09/12/2023

Unsupervised Learning of Nanoindentation Data to Infer Microstructural Details of Complex Materials

In this study, Cu-Cr composites were studied by nanoindentation. Arrays ...

Please sign up or login with your details

Forgot password? Click here to reset