OPORP: One Permutation + One Random Projection

02/07/2023
by   Ping Li, et al.
0

Consider two D-dimensional data vectors (e.g., embeddings): u, v. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, D=256∼ 1024 are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the “count-sketch” type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector r is generated i.i.d. with moments: E(r_i) = 0, E(r_i^2)=1, E(r_i^3) =0, E(r_i^4)=s. We multiply (as dot product) r with all permuted data vectors. Then we break the D columns into k equal-length bins and aggregate (i.e., sum) the values in each bin to obtain k samples from each data vector. One crucial step is to normalize the k samples to the unit l_2 norm. We show that the estimation variance is essentially: (s-1)A + D-k/D-11/k[ (1-ρ^2)^2 -2A], where A≥ 0 is a function of the data (u,v). This formula reveals several key properties: (1) We need s=1. (2) The factor D-k/D-1 can be highly beneficial in reducing variances. (3) The term 1/k(1-ρ^2)^2 is actually the asymptotic variance of the classical correlation estimator. We illustrate that by letting the k in OPORP to be k=1 and repeat the procedure m times, we exactly recover the work of “very spars random projections” (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2018

Sign-Full Random Projections

The method of 1-bit ("sign-sign") random projections has been a popular ...
research
12/01/2021

Invariance principle of random projection for the norm

Johnson-Lindenstrauss guarantees certain topological structure is preser...
research
09/10/2021

C-MinHash: Practically Reducing Two Permutations to Just One

Traditional minwise hashing (MinHash) requires applying K independent pe...
research
05/22/2023

Differential Privacy with Random Projections and Sign Random Projections

In this paper, we develop a series of differential privacy (DP) algorith...
research
01/20/2020

On the Joint Typicality of Permutations of Sequences of Random Variables

Permutations of correlated sequences of random variables appear naturall...
research
11/18/2021

C-OPH: Improving the Accuracy of One Permutation Hashing (OPH) with Circulant Permutations

Minwise hashing (MinHash) is a classical method for efficiently estimati...
research
10/10/2020

Temporal Vectorization for Stencils

Stencil computations represent a very common class of nested loops in sc...

Please sign up or login with your details

Forgot password? Click here to reset