Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions

11/09/2018
by   Daniel Ting, et al.
0

The Count-Min sketch is an important and well-studied data summarization method. It allows one to estimate the count of any item in a stream using a small, fixed size data sketch. However, the accuracy of the sketch depends on characteristics of the underlying data. This has led to a number of count estimation procedures which work well in one scenario but perform poorly in others. A practitioner is faced with two basic, unanswered questions. Which variant should be chosen when the data is unknown? Given an estimate, is its error sufficiently small to be trustworthy? We provide answers to these questions. We derive new count estimators, including a provably optimal estimator, which best or match previous estimators in all scenarios. We also provide practical, tight error bounds at query time for both new and existing estimators. These error estimates also yield procedures to choose the sketch tuning parameters optimally, as they can extrapolate the error to different choices of sketch width and depth. The key observation is that the distribution of errors in each counter can be empirically estimated from the sketch itself. By first estimating this distribution, count estimation becomes a statistical estimation and inference problem with a known error distribution. This provides both a principled way to derive new and optimal estimators as well as a way to study the error and properties of existing estimators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2019

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

The frequencies of the elements in a data stream are an important statis...
research
11/06/2021

Frequency Estimation with One-Sided Error

Frequency estimation is one of the most fundamental problems in streamin...
research
06/06/2023

Statistical inference for sketching algorithms

Sketching algorithms use random projections to generate a smaller sketch...
research
06/11/2022

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

In data mining, estimating the number of distinct values (NDV) is a fund...
research
03/28/2022

A Formal Analysis of the Count-Min Sketch with Conservative Updates

Count-Min Sketch with Conservative Updates (CMS-CU) is a popular algorit...
research
02/24/2021

SALSA: Self-Adjusting Lean Streaming Analytics

Counters are the fundamental building block of many data sketching schem...
research
06/03/2021

Cause specific rate functions for panel count data with multiple modes of recurrence

Panel count data arise from longitudinal studies on recurrent events whe...

Please sign up or login with your details

Forgot password? Click here to reset