Simple and Efficient Cardinality Estimation in Data Streams

08/20/2020
by   Seth Pettie, et al.
0

We study sketching schemes for the cardinality estimation problem in data streams, and advocate for measuring the efficiency of such a scheme in terms of its MVP: Memory-Variance Product, i.e., the product of its space, in bits, and the relative variance of its estimates. Under this natural metric, the celebrated HyperLogLog sketch of Flajolet et al. (2007) has an MVP approaching 6(3ln 2-1)≈ 6.48 for estimating cardinalities up to 2^64. Applying the Cohen/Ting (2014) martingale transformation results in a sketch Martingale HyperLogLog with MVP ≈ 4.16, though it is not composable. Recently Pettie and Wang (2020) proved that it is possible to achieve MVP approaching ≈ 1.98 with a composable sketch called Fishmonger, though the time required to update this sketch is not constant. Our aim in this paper is to strike a nice balance between extreme simplicity (exemplified by (Martingale) (Hyper)LogLog) and extreme information-theoretic efficiency exemplified by Fishmonger). We develop a new class of "curtain" sketches that are a bit more complex than Martingale LogLog but with substantially better s, e.g., Martingale Curtain has MVP ≈ 2.31. We also prove that Martingale Fishmonger has an MVP of around 1.63, and conjecture this to be an information-theoretic lower bound on the problem, independent of update time.

READ FULL TEXT

page 8

page 9

page 10

research
07/16/2020

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

In this paper we study the intrinsic tradeoff between the space complexi...
research
05/23/2022

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sk...
research
10/23/2017

HyperMinHash: Jaccard index sketching in LogLog space

In this extended abstract, we describe and analyse a streaming probabili...
research
02/09/2016

Graphical Model Sketch

Structured high-cardinality data arises in many domains, and poses a maj...
research
05/24/2020

HyperLogLog Sketch Acceleration on FPGA

Data sketches are a set of widely used approximated data summarizing tec...
research
03/04/2022

Improving Tug-of-War sketch using Control-Variates method

Computing space-efficient summary, or a.k.a. sketches, of large data, is...
research
05/22/2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...

Please sign up or login with your details

Forgot password? Click here to reset