Learning-augmented count-min sketches via Bayesian nonparametrics

02/08/2021
by   Emanuele Dolera, et al.
0

The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens' frequencies in a data stream, i.e. point queries, based on random hashed data. Learning-augmented CMSs improve the CMS by learning models that allow to better exploit data properties. In this paper, we focus on the learning-augmented CMS of Cai, Mitzenmacher and Adams (NeurIPS 2018), which relies on Bayesian nonparametric (BNP) modeling of a data stream via Dirichlet process (DP) priors. This is referred to as the CMS-DP, and it leads to BNP estimates of a point query as posterior means of the point query given the hashed data. While BNPs is proved to be a powerful tool for developing robust learning-augmented CMSs, ideas and methods behind the CMS-DP are tailored to point queries under DP priors, and they can not be used for other priors or more general queries. In this paper, we present an alternative, and more flexible, derivation of the CMS-DP such that: i) it allows to make use of the Pitman-Yor process (PYP) prior, which is arguably the most popular generalization of the DP prior; ii) it can be readily applied to the more general problem of estimating range queries. This leads to develop a novel learning-augmented CMS under power-law data streams, referred to as the CMS-PYP, which relies on BNP modeling of the stream via PYP priors. Applications to synthetic and real data show that the CMS-PYP outperforms the CMS and the CMS-DP in the estimation of low-frequency tokens; this known to be a critical feature in natural language processing, where it is indeed common to encounter power-law data streams.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2021

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides ...
research
03/27/2023

Random measure priors in Bayesian frequency recovery from sketches

Given a lossy-compressed representation, or sketch, of data with values ...
research
08/28/2023

Posterior distributions of Gibbs-type priors

Gibbs type priors have been shown to be natural generalizations of Diric...
research
04/08/2013

ClusterCluster: Parallel Markov Chain Monte Carlo for Dirichlet Process Mixtures

The Dirichlet process (DP) is a fundamental mathematical tool for Bayesi...
research
03/16/2022

Differentiable DAG Sampling

We propose a new differentiable probabilistic model over DAGs (DP-DAG). ...
research
06/07/2022

Inexact and primal multilevel FETI-DP methods: a multilevel extension and interplay with BDDC

We study a framework that allows to solve the coarse problem in the FETI...
research
10/13/2014

Mining Block I/O Traces for Cache Preloading with Sparse Temporal Non-parametric Mixture of Multivariate Poisson

Existing caching strategies, in the storage domain, though well suited t...

Please sign up or login with your details

Forgot password? Click here to reset