HyperMinHash: MinHash in LogLog space

10/23/2017
by   Yun William Yu, et al.
0

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard n-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ ϵ on a Jaccard index t , given a random oracle, HyperMinHash needs O(ϵ^-2( n + 1/ t ϵ)) space. Unlike comparable Jaccard index fingerprinting algorithms (such as b-bit MinHash, which uses less space), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. Our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10^19 with relative error of around 10% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 10^10 with the same memory consumption. Note that we will operate in the unbounded data stream model and assume both a random oracle and shared randomness.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2017

HyperMinHash: Jaccard index sketching in LogLog space

In this extended abstract, we describe and analyse a streaming probabili...
research
05/22/2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...
research
11/06/2021

Frequency Estimation with One-Sided Error

Frequency estimation is one of the most fundamental problems in streamin...
research
06/11/2021

ExtendedHyperLogLog: Analysis of a new Cardinality Estimator

We discuss the problem of counting distinct elements in a stream. A stre...
research
05/07/2019

Exponential Separations Between Turnstile Streaming and Linear Sketching

Almost every known turnstile streaming algorithm is implementable as a l...
research
07/17/2018

Tracking the ℓ_2 Norm with Constant Update Time

The ℓ_2 tracking problem is the task of obtaining a streaming algorithm ...
research
02/09/2016

Graphical Model Sketch

Structured high-cardinality data arises in many domains, and poses a maj...

Please sign up or login with your details

Forgot password? Click here to reset