Bayesian Locality Sensitive Hashing for Fast Similarity Search

10/06/2011
by   Venu Satuluri, et al.
0

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.

READ FULL TEXT
research
02/10/2020

Locality-sensitive hashing in function spaces

We discuss the problem of performing similarity search over function spa...
research
03/06/2020

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for lar...
research
09/07/2018

FRESH: Fréchet Similarity with Hashing

Massive datasets of curves, such as time series and trajectories, are co...
research
05/21/2020

Succinct Trit-array Trie for Scalable Trajectory Similarity Search

Massive datasets of spatial trajectories representing the mobility of a ...
research
06/08/2018

A neural network catalyzer for multi-dimensional similarity search

This paper aims at learning a function mapping input vectors to an outpu...
research
10/06/2021

A Fast Randomized Algorithm for Massive Text Normalization

Many popular machine learning techniques in natural language processing ...
research
04/13/2020

SLIM: Scalable Linkage of Mobility Data

We present a scalable solution to link entities across mobility datasets...

Please sign up or login with your details

Forgot password? Click here to reset