Dimension Independent Similarity Computation

06/11/2012
by   Reza Bosagh Zadeh, et al.
0

We present a suite of algorithms for Dimension Independent Similarity Computation (DISCO) to compute all pairwise similarities between very high dimensional sparse vectors. All of our results are provably independent of dimension, meaning apart from the initial cost of trivially reading in the data, all subsequent operations are independent of the dimension, thus the dimension can be very large. We study Cosine, Dice, Overlap, and the Jaccard similarity measures. For Jaccard similiarity we include an improved version of MinHash. Our results are geared toward the MapReduce framework. We empirically validate our theorems at large scale using data from the social networking site Twitter. At time of writing, our algorithms are live in production at twitter.com.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2023

Dimension Independent Helly Theorem for Lines and Flats

We give a generalization of dimension independent Helly Theorem of Adipr...
research
10/21/2019

A Comparison of Semantic Similarity Methods for Maximum Human Interpretability

The inclusion of semantic information in any similarity measures improve...
research
11/01/2019

Notes on the dimension dependence in high-dimensional central limit theorems for hyperrectangles

Let X_1,...,X_n be independent centered random vectors in R^d. This pape...
research
02/04/2019

What is the dimension of your binary data?

Many 0/1 datasets have a very large number of variables; on the other ha...
research
11/16/2021

Isotropic vectors over global fields

We present a complete suite of algorithms for finding isotropic vectors ...
research
06/30/2018

The Historical Significance of Textual Distances

Measuring similarity is a basic task in information retrieval, and now o...

Please sign up or login with your details

Forgot password? Click here to reset