HyperMinHash: Jaccard index sketching in LogLog space

10/23/2017
by   Yun William Yu, et al.
0

In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard MinHash by building off of a HyperLogLog count-distinct sketch. Given Jaccard index δ, using k buckets of size O((l) + (|A ∪ B|)) (in practice, typically 2 bytes) per set, HyperMinHash streams over A and B and generates an estimate of the Jaccard index δ with error O(1/l + √(k/δ)). This improves on the best previously known sketch, MinHash, which requires the same number of storage units (buckets), but using O((|A ∪ B|)) bit per bucket. For instance, our new algorithm allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 10^19 with relative error of around 5 64KiB of memory; the previous state-of-the-art MinHash can only estimate Jaccard indices for cardinalities of 10^10 with the same memory consumption. Alternately, one can think of HyperMinHash as an augmentation of b-bit MinHash that enables streaming updates, unions, and cardinality estimation (and thus intersection cardinality by way of Jaccard), while using extra bits.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset