An embarrassingly parallel optimal-space cardinality estimation algorithm

07/03/2023
by   Emin Karayel, et al.
0

In 2020 Blasiok (ACM Trans. Algorithms 16(2) 3:1-3:28) constructed an optimal space streaming algorithm for the cardinality estimation problem with the space complexity of 𝒪(ε^-2ln(δ^-1) + ln n) where ε, δ and n denote the relative accuracy, failure probability and universe size, respectively. However, his solution requires the stream to be processed sequentially. On the other hand, there are algorithms that admit a merge operation; they can be used in a distributed setting, allowing parallel processing of sections of the stream, and are highly relevant for large-scale distributed applications. The best-known such algorithm, unfortunately, has a space complexity exceeding Ω(ln(δ^-1) (ε^-2lnln n + ln n)). This work presents a new algorithm that improves on the solution by Blasiok, preserving its space complexity, but with the benefit that it admits such a merge operation, thus providing an optimal solution for the problem for both sequential and parallel applications. Orthogonally, the new algorithm also improves algorithmically on Blasiok's solution (even in the sequential setting) by reducing its implementation complexity and requiring fewer distinct pseudo-random objects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

ExtendedHyperLogLog: Analysis of a new Cardinality Estimator

We discuss the problem of counting distinct elements in a stream. A stre...
research
04/05/2018

Optimal streaming and tracking distinct elements with high probability

The distinct elements problem is one of the fundamental problems in stre...
research
07/18/2023

Scalable Auction Algorithms for Bipartite Maximum Matching Problems

In this paper, we give new auction algorithms for maximum weighted bipar...
research
01/01/2021

SetSketch: Filling the Gap between MinHash and HyperLogLog

MinHash and HyperLogLog are sketching algorithms that have become indisp...
research
03/13/2019

Cardinality Estimation in a Virtualized Network Device Using Online Machine Learning

Cardinality estimation algorithms receive a stream of elements, with pos...
research
02/13/2023

Maximum Coverage in Sublinear Space, Faster

Given a collection of m sets from a universe 𝒰, the Maximum Set Coverage...
research
08/31/2023

UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting

Since its invention HyperLogLog has become the standard algorithm for ap...

Please sign up or login with your details

Forgot password? Click here to reset