SetSketch: Filling the Gap between MinHash and HyperLogLog

01/01/2021
by   Otmar Ertl, et al.
0

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Robust and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The developed methods can also be used for HyperLogLog sketches and allow estimation of joint quantities such as the intersection size with a smaller error compared to the common estimation approach based on the inclusion-exclusion principle.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2021

ExtendedHyperLogLog: Analysis of a new Cardinality Estimator

We discuss the problem of counting distinct elements in a stream. A stre...
research
07/29/2022

Quantifying uncertain system outputs via the multi-level Monte Carlo method – distribution and robustness measures

In this work, we consider the problem of estimating the probability dist...
research
07/03/2023

An embarrassingly parallel optimal-space cardinality estimation algorithm

In 2020 Blasiok (ACM Trans. Algorithms 16(2) 3:1-3:28) constructed an op...
research
07/26/2021

A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation

Cardinality estimation is a fundamental problem in database systems. To ...
research
07/12/2018

Tails and probabilities for extreme outliers

The task of estimation of the tails of probability distributions having ...
research
05/03/2021

Model Counting meets F0 Estimation

Constraint satisfaction problems (CSP's) and data stream models are two ...
research
08/17/2020

Cardinality estimation using Gumbel distribution

Cardinality estimation is the task of approximating the number of distin...

Please sign up or login with your details

Forgot password? Click here to reset