A Bayesian nonparametric approach to count-min sketch under power-law data streams

02/07/2021
by   Emanuele Dolera, et al.
0

The count-min sketch (CMS) is a randomized data structure that provides estimates of tokens' frequencies in a large data stream using a compressed representation of the data by random hashing. In this paper, we rely on a recent Bayesian nonparametric (BNP) view on the CMS to develop a novel learning-augmented CMS under power-law data streams. We assume that tokens in the stream are drawn from an unknown discrete distribution, which is endowed with a normalized inverse Gaussian process (NIGP) prior. Then, using distributional properties of the NIGP, we compute the posterior distribution of a token's frequency in the stream, given the hashed data, and in turn corresponding BNP estimates. Applications to synthetic and real data show that our approach achieves a remarkable performance in the estimation of low-frequency tokens. This is known to be a desirable feature in the context of natural language processing, where it is indeed common in the context of the power-law behaviour of the data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2021

Learning-augmented count-min sketches via Bayesian nonparametrics

The count-min sketch (CMS) is a time and memory efficient randomized dat...
research
03/27/2023

Random measure priors in Bayesian frequency recovery from sketches

Given a lossy-compressed representation, or sketch, of data with values ...
research
04/01/2022

Double-Hashing Algorithm for Frequency Estimation in Data Streams

Frequency estimation of elements is an important task for summarizing da...
research
07/04/2022

Learning state machines via efficient hashing of future traces

State machines are popular models to model and visualize discrete system...
research
09/05/2022

Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

The estimation of coverage probabilities, and in particular of the missi...
research
11/20/2022

Pragmatic Constraint on Distributional Semantics

This paper studies the limits of language models' statistical learning i...
research
03/28/2022

A Formal Analysis of the Count-Min Sketch with Conservative Updates

Count-Min Sketch with Conservative Updates (CMS-CU) is a popular algorit...

Please sign up or login with your details

Forgot password? Click here to reset