DeepAI AI Chat
Log In Sign Up

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

by   Seth Pettie, et al.

In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (π–₯π—‚π—Œπ—) number β„‹/ℐ. It captures the tension between the limiting Shannon entropy (β„‹) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖒𝖲𝖠 sketch of Flajolet and Martin have π–₯π—‚π—Œπ— number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as qβ†’βˆž. All other known sketches have even worse π–₯π—‚π—Œπ—-numbers. We introduce a new sketch called π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ that is based on a smoothed, compressed version of 𝖯𝖒𝖲𝖠 with a different estimation function. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ has π–₯π—‚π—Œπ— number H_0/I_0 β‰ˆ 1.98. It stores O(log^2log U) + (H_0/I_0)b β‰ˆ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹'s space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the π–₯π—‚π—Œπ—-number H_0/I_0 is a universal lower bound for any such composable sketch.


page 1

page 2

page 3

page 4

βˆ™ 08/20/2020

Simple and Efficient Cardinality Estimation in Data Streams

We study sketching schemes for the cardinality estimation problem in dat...
βˆ™ 05/23/2022

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sk...
βˆ™ 08/22/2022

Simpler and Better Cardinality Estimators for HyperLogLog and PCSA

Cardinality Estimation (aka Distinct Elements) is a classic problem in s...
βˆ™ 07/10/2018

Understanding VAEs in Fisher-Shannon Plane

In information theory, Fisher information and Shannon information (entro...
βˆ™ 01/31/2019

The Relation Between Bayesian Fisher Information and Shannon Information for Detecting a Change in a Parameter

We derive a connection between performance of estimators the performance...
βˆ™ 10/23/2017

HyperMinHash: Jaccard index sketching in LogLog space

In this extended abstract, we describe and analyse a streaming probabili...
βˆ™ 05/09/2012

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the a...