HyperMinHash: Jaccard index sketching in LogLog space

10/23/2017
by   Yun William Yu, et al.
0

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size O((l) + (|A ∪ B|)) (in practice, typically 2 bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O(1/l + √(k/δ)). This improves on the best previously known sketch, MinHash, which requires the same number of storage units (buckets), but using O((|A ∪ B|)) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10^19 with relative error of around 5 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 10^10 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation (and thus intersection cardinality by way of Jaccard), while using extra bits.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2017

HyperMinHash: MinHash in LogLog space

In this extended abstract, we describe and analyse a streaming probabili...
research
05/22/2019

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Estimating set similarity and detecting highly similar sets are fundamen...
research
02/09/2016

Graphical Model Sketch

Structured high-cardinality data arises in many domains, and poses a maj...
research
08/20/2020

Simple and Efficient Cardinality Estimation in Data Streams

We study sketching schemes for the cardinality estimation problem in dat...
research
05/24/2020

HyperLogLog Sketch Acceleration on FPGA

Data sketches are a set of widely used approximated data summarizing tec...
research
05/23/2022

HyperLogLogLog: Cardinality Estimation With One Log More

We present HyperLogLogLog, a practical compression of the HyperLogLog sk...
research
07/16/2020

Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

In this paper we study the intrinsic tradeoff between the space complexi...

Please sign up or login with your details

Forgot password? Click here to reset