Simpler and Better Cardinality Estimators for HyperLogLog and PCSA

08/22/2022
by   Seth Pettie, et al.
0

Cardinality Estimation (aka Distinct Elements) is a classic problem in sketching with many industrial applications. Although sketching algorithms are fairly simple, analyzing the cardinality estimators is notoriously difficult, and even today the state-of-the-art sketches such as HyperLogLog and (compressed) are not covered in graduate level Big Data courses. In this paper we define a class of generalized remaining area () estimators, and observe that HyperLogLog, LogLog, and some estimators for PCSA are merely instantiations of for various integral values of τ. We then analyze the limiting relative variance of estimators. It turns out that the standard estimators for HyperLogLog and PCSA can be improved by choosing a fractional value of τ. The resulting estimators come very close to the Cramér-Rao lower bounds for HyperLogLog and PCSA derived from their Fisher information. Although the Cramér-Rao lower bound can be achieved with the Maximum Likelihood Estimator (MLE), the MLE is cumbersome to compute and dynamically update. In contrast, estimators are trivial to update in constant time. Our presentation assumes only basic calculus and probability, not any complex analysis <cit.>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2022

SafeBound: A Practical System for Generating Cardinality Bounds

Recent work has reemphasized the importance of cardinality estimates for...
research
07/16/2020

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

In this paper we study the intrinsic tradeoff between the space complexi...
research
11/30/2018

Per-Flow Cardinality Estimation Based On Virtual LogLog Sketching

Flow cardinality estimation is the problem of estimating the number of d...
research
08/17/2018

Cardinality Estimators do not Preserve Privacy

Cardinality estimators like HyperLogLog are sketching algorithms that es...
research
08/17/2020

Cardinality estimation using Gumbel distribution

Cardinality estimation is the task of approximating the number of distin...
research
08/07/2022

Generalized Estimators, Slope, Efficiency, and Fisher Information Bounds

Point estimators may not exist, need not be unique, and their distributi...
research
05/19/2021

Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs

We study two classes of summary-based cardinality estimators that use st...

Please sign up or login with your details

Forgot password? Click here to reset