Optimized Cartesian K-Means

05/16/2014
by   Jianfeng Wang, et al.
0

Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian K-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.

READ FULL TEXT

page 12

page 13

research
08/25/2017

Subspace Approximation for Approximate Nearest Neighbor Search in NLP

Most natural language processing tasks can be formulated as the approxim...
research
05/30/2023

AdANNS: A Framework for Adaptive Semantic Search

Web-scale search systems learn an encoder to embed a given query which i...
research
12/23/2014

Approximate Subspace-Sparse Recovery with Corrupted Data via Constrained ℓ_1-Minimization

High-dimensional data often lie in low-dimensional subspaces correspondi...
research
02/25/2016

Auto-JacoBin: Auto-encoder Jacobian Binary Hashing

Binary codes can be used to speed up nearest neighbor search tasks in la...
research
02/16/2017

RIPML: A Restricted Isometry Property based Approach to Multilabel Learning

The multilabel learning problem with large number of labels, features, a...
research
12/21/2018

Quicker ADC : Unlocking the hidden potential of Product Quantization with SIMD

Efficient Nearest Neighbor (NN) search in high-dimensional spaces is a f...
research
11/05/2012

Efficient Point-to-Subspace Query in ℓ^1: Theory and Applications in Computer Vision

Motivated by vision tasks such as robust face and object recognition, we...

Please sign up or login with your details

Forgot password? Click here to reset