2-Bit Random Projections, NonLinear Estimators, and Approximate Near Neighbor Search

02/21/2016
by   Ping Li, et al.
0

The method of random projections has become a standard tool for machine learning, data mining, and search with massive data at Web scale. The effective use of random projections requires efficient coding schemes for quantizing (real-valued) projected data into integers. In this paper, we focus on a simple 2-bit coding scheme. In particular, we develop accurate nonlinear estimators of data similarity based on the 2-bit strategy. This work will have important practical applications. For example, in the task of near neighbor search, a crucial step (often called re-ranking) is to compute or estimate data similarities once a set of candidate data points have been identified by hash table techniques. This re-ranking step can take advantage of the proposed coding scheme and estimator. As a related task, in this paper, we also study a simple uniform quantization scheme for the purpose of building hash tables with projected data. Our analysis shows that typically only a small number of bits are needed. For example, when the target similarity level is high, 2 or 3 bits might be sufficient. When the target similarity level is not so high, it is preferable to use only 1 or 2 bits. Therefore, a 2-bit scheme appears to be overall a good choice for the task of sublinear time approximate near neighbor search via hash tables. Combining these results, we conclude that 2-bit random projections should be recommended for approximate near neighbor search and similarity estimation. Extensive experimental results are provided.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2021

Quantization Algorithms for Random Fourier Features

The method of random projection (RP) is the standard technique in machin...
research
10/17/2011

Anti-sparse coding for approximate nearest neighbor search

This paper proposes a binarization scheme for vectors of high dimension ...
research
04/18/2019

Query-Adaptive Hash Code Ranking for Large-Scale Multi-View Visual Search

Hash based nearest neighbor search has become attractive in many applica...
research
06/13/2023

Practice with Graph-based ANN Algorithms on Sparse Data: Chi-square Two-tower model, HNSW, Sign Cauchy Projections

Sparse data are common. The traditional “handcrafted” features are often...
research
04/27/2015

Sign Stable Random Projections for Large-Scale Learning

We study the use of "sign α-stable random projections" (where 0<α≤ 2) fo...
research
04/26/2018

Sign-Full Random Projections

The method of 1-bit ("sign-sign") random projections has been a popular ...
research
09/07/2021

C-MinHash: Rigorously Reducing K Permutations to Two

Minwise hashing (MinHash) is an important and practical algorithm for ge...

Please sign up or login with your details

Forgot password? Click here to reset