Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

07/16/2020
βˆ™
by   Seth Pettie, et al.
βˆ™
0
βˆ™

In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error in the Random Oracle model. We define a new measure of efficiency for cardinality estimators called the Fisher-Shannon (π–₯π—‚π—Œπ—) number β„‹/ℐ. It captures the tension between the limiting Shannon entropy (β„‹) of the sketch and its normalized Fisher information (ℐ), which characterizes (asymptotically) the variance of a statistically efficient estimator. We prove that many variants of the 𝖯𝖒𝖲𝖠 sketch of Flajolet and Martin have π–₯π—‚π—Œπ— number H_0/I_0, where H_0,I_0 are two precisely-defined constants, and that all base-q generalizations of (Hyper)LogLog are strictly worse than H_0/I_0, but tend to H_0/I_0 in the limit as qβ†’βˆž. All other known sketches have even worse π–₯π—‚π—Œπ—-numbers. We introduce a new sketch called π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ that is based on a smoothed, compressed version of 𝖯𝖒𝖲𝖠 with a different estimation function. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ has π–₯π—‚π—Œπ— number H_0/I_0 β‰ˆ 1.98. It stores O(log^2log U) + (H_0/I_0)b β‰ˆ 1.98b bits, and estimates cardinalities of multisets of [U] with a standard error of (1+o(1))/√(b). π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹'s space-error tradeoff improves on state-of-the-art sketches like HyperLogLog, or even compressed representations of it. π–₯π—‚π—Œπ—π—†π—ˆπ—‡π—€π–Ύπ—‹ can be used in a distributed environment (where substreams are sketched separately and composed later). We conjecture that the π–₯π—‚π—Œπ—-number H_0/I_0 is a universal lower bound for any such composable sketch.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset