ProbMinHash – A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

11/02/2019
by   Otmar Ertl, et al.
0

The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of locality-sensitive one-pass hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as direct replacements. The other two may even improve the estimation error by breaking the statistical independence of signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2018

Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

We introduce simple, efficient algorithms for computing a MinHash of a p...
research
02/12/2018

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Minwise hashing has become a standard tool to calculate signatures which...
research
09/17/2015

Learning to Hash for Indexing Big Data - A Survey

The explosive growth in big data has attracted much attention in designi...
research
05/23/2020

DartMinHash: Fast Sketching for Weighted Sets

Weighted minwise hashing is a standard dimensionality reduction techniqu...
research
06/22/2019

Algorithms for Similarity Search and Pseudorandomness

We study the problem of approximate near neighbor (ANN) search and show ...
research
06/01/2016

A Survey on Learning to Hash

Nearest neighbor search is a problem of finding the data points from the...
research
04/24/2022

Locality Sensitive Hashing for Structured Data: A Survey

Data similarity (or distance) computation is a fundamental research topi...

Please sign up or login with your details

Forgot password? Click here to reset