Similarity Join Size Estimation using Locality Sensitive Hashing

04/16/2011
by   Hongrae Lee, et al.
0

Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses the Locality-Sensitive-Hashing (LSH) scheme. The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates at both high and low similarity thresholds and small variance using real-world data sets.

READ FULL TEXT
research
06/08/2018

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size es...
research
03/06/2020

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

All-pairs set similarity is a widely used data mining task, even for lar...
research
04/16/2018

Adaptive MapReduce Similarity Joins

Similarity joins are a fundamental database operation. Given data sets S...
research
07/04/2019

Hardness of Bichromatic Closest Pair with Jaccard Similarity

Consider collections A and B of red and blue sets, respectively. Bichrom...
research
06/08/2018

Similarity Join and Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size es...
research
06/13/2017

Preference-driven Similarity Join

Similarity join, which can find similar objects (e.g., products, names, ...
research
07/02/2018

Distributed Statistical Estimation of Matrix Products with Applications

We consider statistical estimations of a matrix product over the integer...

Please sign up or login with your details

Forgot password? Click here to reset