In Defense of MinHash Over SimHash

07/16/2014
by   Anshumali Shrivastava, et al.
0

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S^2≤R≤S/2-S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R≥S/z-S holds where z is only slightly larger than 2 (e.g., z≤ 2.1). Our restricted worst case analysis by assuming S/z-S≤R≤S/2-S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2022

Using Locality-sensitive Hashing for Rendezvous Search

The multichannel rendezvous problem is a fundamental problem for neighbo...
research
07/08/2021

A Triangle Inequality for Cosine Similarity

Similarity search is a fundamental problem for many data analysis techni...
research
04/14/2023

Optimal Uncoordinated Unique IDs

In the Uncoordinated Unique Identifiers Problem (UUIDP) there are n inde...
research
05/25/2020

On the Problem of p_1^-1 in Locality-Sensitive Hashing

A Locality-Sensitive Hash (LSH) function is called (r,cr,p_1,p_2)-sensit...
research
08/11/2021

Learning to Hash Robustly, with Guarantees

The indexing algorithms for the high-dimensional nearest neighbor search...
research
12/15/2015

Data Driven Resource Allocation for Distributed Learning

In distributed machine learning, data is dispatched to multiple machines...
research
04/09/2018

Set Similarity Search for Skewed Data

Set similarity join, as well as the corresponding indexing problem set s...

Please sign up or login with your details

Forgot password? Click here to reset