Random measure priors in Bayesian frequency recovery from sketches
Given a lossy-compressed representation, or sketch, of data with values in a set of symbols, the frequency recovery problem considers the estimation of the empirical frequency of a new data point. Recent studies have applied Bayesian nonparametrics (BNPs) to develop learning-augmented versions of the popular count-min sketch (CMS) recovery algorithm. In this paper, we present a novel BNP approach to frequency recovery, which is not built from the CMS but still relies on a sketch obtained by random hashing. Assuming data to be modeled as random samples from an unknown discrete distribution, which is endowed with a Poisson-Kingman (PK) prior, we provide the posterior distribution of the empirical frequency of a symbol, given the sketch. Estimates are then obtained as mean functionals. An application of our result is presented for the Dirichlet process (DP) and Pitman-Yor process (PYP) priors, and in particular: i) we characterize the DP prior as the sole PK prior featuring a property of sufficiency with respect to the sketch, leading to a simple posterior distribution; ii) we identify a large sample regime under which the PYP prior leads to a simple approximation of the posterior distribution. Then, we develop our BNP approach to a "traits" formulation of the frequency recovery problem, not yet studied in the CMS literature, in which data belong to more than one symbol (trait), and exhibit nonnegative integer levels of associations with each trait. In particular, by modeling data as random samples from a generalized Indian buffet process, we provide the posterior distribution of the empirical frequency level of a trait, given the sketch. This result is then applied under the assumption of a Poisson and Bernoulli distribution for the levels of associations, leading to a simple posterior distribution and a simple approximation of the posterior distribution, respectively.
READ FULL TEXT