Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

07/16/2020
βˆ™
by   Seth Pettie, et al.
βˆ™
0
βˆ™

In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (π–₯π—‚π—Œπ—) number β„‹/ℐ. It captures the tension between the limiting Shannon entropy (β„‹) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖒𝖲𝖠 sketch of Flajolet and Martin have π–₯π—‚π—Œπ— number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as qβ†’βˆž. All other known sketches have even worse π–₯π—‚π—Œπ—-numbers. We introduce a new sketch called π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ that is based on a smoothed, compressed version of 𝖯𝖒𝖲𝖠 with a different estimation function. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ has π–₯π—‚π—Œπ— number H_0/I_0 β‰ˆ 1.98. It stores O(log^2log U) + (H_0/I_0)b β‰ˆ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹'s space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the π–₯π—‚π—Œπ—-number H_0/I_0 is a universal lower bound for any such composable sketch.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
βˆ™ 08/20/2020

Simple and Efficient Cardinality Estimation in Data Streams

We study sketching schemes for the cardinality estimation problem in dat...
research
βˆ™ 05/23/2022

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sk...
research
βˆ™ 08/22/2022

Simpler and Better Cardinality Estimators for HyperLogLog and PCSA

Cardinality Estimation (aka Distinct Elements) is a classic problem in s...
research
βˆ™ 07/10/2018

Understanding VAEs in Fisher-Shannon Plane

In information theory, Fisher information and Shannon information (entro...
research
βˆ™ 01/31/2019

The Relation Between Bayesian Fisher Information and Shannon Information for Detecting a Change in a Parameter

We derive a connection between performance of estimators the performance...
research
βˆ™ 10/23/2017

HyperMinHash: Jaccard index sketching in LogLog space

In this extended abstract, we describe and analyse a streaming probabili...
research
βˆ™ 11/24/2019

Secure Sketch for All Noisy Sources

Secure sketch produces public information of its input w without reveali...

Please sign up or login with your details

Forgot password? Click here to reset