Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

09/05/2022
by   Stefano Favaro, et al.
0

The estimation of coverage probabilities, and in particular of the missing mass, is a classical statistical problem with applications in numerous scientific fields. In this paper, we study this problem in relation to randomized data compression, or sketching. This is a novel but practically relevant perspective, and it refers to situations in which coverage probabilities must be estimated based on a compressed and imperfect summary, or sketch, of the true data, because neither the full data nor the empirical frequencies of distinct symbols can be observed directly. Our contribution is a Bayesian nonparametric methodology to estimate coverage probabilities from data sketched through random hashing, which also solves the challenging problems of recovering the numbers of distinct counts in the true data and of distinct counts with a specified empirical frequency of interest. The proposed Bayesian estimators are shown to be easily applicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior. The empirical effectiveness of our methodology is demonstrated through numerical experiments and applications to real data sets of Covid DNA sequences, classic English literature, and IP addresses.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2022

Conformal Frequency Estimation with Sketched Data under Relaxed Exchangeability

A flexible method is developed to construct a confidence interval for th...
research
04/08/2022

Conformalized Frequency Estimation from Sketched Data

A flexible conformal inference method is developed to construct confiden...
research
09/03/2022

Optimal empirical Bayes estimation for the Poisson model via minimum-distance methods

The Robbins estimator is the most iconic and widely used procedure in th...
research
02/07/2021

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides ...
research
03/27/2023

Random measure priors in Bayesian frequency recovery from sketches

Given a lossy-compressed representation, or sketch, of data with values ...

Please sign up or login with your details

Forgot password? Click here to reset